Extend 2 billion row benchmarks e.g. memory usage, sorting, joining, by-reference #2

arunsrinivasan · 2014-06-08T01:23:10Z

We've currently gone to 2E9 rows (the 32bit index limit) with 9 columns (100GB). See benchmarks page on wiki.

Ideally it would be great to compare all available tools that are either specifically developed for large in-memory data manipulation or are capable of handling data at these sizes much better than base. Of course base-R should also be included, typically as control.

Aspect of benchmarking should be to highlight not just run time (speed), but also memory usage. The sorting/ordering by reference, sub-assignment by reference etc.. features, for example, at this data size should display quite clearly on speed and memory gains attainable.

mattdowle · 2014-12-12T18:45:11Z

For memory usage, perhaps: https:/gsauthof/cgmemtime

arunsrinivasan · 2015-10-28T16:52:40Z

Figured cgmemtime out. Quite straightforward. Just need some time to run through grouping/joins/reshaping benchmarks.

jangorecki · 2018-12-15T13:53:29Z

I would like to close this one as it is already epic, and will be epic for a long time, due to broad scope defined here.
Work on benchmarks has been shifted to a dedicated h2o project https:/h2oai/db-benchmark and related issues are:

memory: measure memory usage h2oai/db-benchmark#9
join: advanced questions for join tests h2oai/db-benchmark#18
sort: sort task h2oai/db-benchmark#61
by reference ops: "by reference" operations task h2oai/db-benchmark#62

arunsrinivasan added www and removed documentation labels Jun 8, 2014

arunsrinivasan added this to the v1.9.4 milestone Jun 19, 2014

arunsrinivasan added the High label Jun 21, 2014

ecoRoland mentioned this issue Jun 23, 2014

rbind does not check if datetime classes are equal #705

Closed

arunsrinivasan mentioned this issue Sep 21, 2014

Rerun pandas 2E9 benchmark from dev #823

Closed

mattdowle changed the title ~~Extensive benchmarking on large in-memory dataset~~ Extend 2 billion row benchmarks Sep 26, 2014

mattdowle modified the milestones: v1.9.6, v1.9.4 Sep 26, 2014

arunsrinivasan modified the milestones: v1.9.6, v1.9.8 Oct 10, 2014

mattdowle changed the title ~~Extend 2 billion row benchmarks~~ Extend 2 billion row benchmarks e.g. memory usage, sorting, joining, by-reference Dec 12, 2014

holgerman mentioned this issue Mar 9, 2015

[Request] Please include "\n" as additional default separator for parameter sep in fread() for improved backwards-compatibility #1073

Closed

vttrifonov mentioned this issue Apr 28, 2015

%in% returns repeated rows #1131

Closed

xinyongtian mentioned this issue Jun 5, 2015

sampling in group not working properly #1170

Closed

dbetebenner mentioned this issue May 16, 2016

Inconsistent behavior between R 3.3.0 Linux/R 3.3.0 OSX using 1.9.7 (but no issue with 1.9.6) #1705

Closed

arunsrinivasan modified the milestones: v2.0.0, v1.9.8 Jul 21, 2016

akersting mentioned this issue May 19, 2017

fread segfault on dev #2131

Closed

MichaelChirico mentioned this issue Jul 14, 2017

[Support] Efficient reshaping with melt - Error: measure variables not found in data #2270

Closed

mattdowle removed this from the Candidate milestone May 10, 2018

jangorecki self-assigned this Jun 1, 2018

jangorecki closed this as completed Dec 15, 2018

MichaelChirico pushed a commit that referenced this issue Dec 16, 2019

tracing #2

05e7485

ColeMiller1 mentioned this issue Sep 21, 2020

use look-up table in IDate conversion #3279

Draft

vk111 mentioned this issue Feb 3, 2022

Bug - data table join doesnt work #5324

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend 2 billion row benchmarks e.g. memory usage, sorting, joining, by-reference #2

Extend 2 billion row benchmarks e.g. memory usage, sorting, joining, by-reference #2

arunsrinivasan commented Jun 8, 2014

mattdowle commented Dec 12, 2014

arunsrinivasan commented Oct 28, 2015

jangorecki commented Dec 15, 2018

Extend 2 billion row benchmarks e.g. memory usage, sorting, joining, by-reference #2

Extend 2 billion row benchmarks e.g. memory usage, sorting, joining, by-reference #2

Comments

arunsrinivasan commented Jun 8, 2014

mattdowle commented Dec 12, 2014

arunsrinivasan commented Oct 28, 2015

jangorecki commented Dec 15, 2018