Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend 2 billion row benchmarks e.g. memory usage, sorting, joining, by-reference #2

Closed
arunsrinivasan opened this issue Jun 8, 2014 · 3 comments

Comments

@arunsrinivasan
Copy link
Member

We've currently gone to 2E9 rows (the 32bit index limit) with 9 columns (100GB). See benchmarks page on wiki.

Ideally it would be great to compare all available tools that are either specifically developed for large in-memory data manipulation or are capable of handling data at these sizes much better than base. Of course base-R should also be included, typically as control.

Aspect of benchmarking should be to highlight not just run time (speed), but also memory usage. The sorting/ordering by reference, sub-assignment by reference etc.. features, for example, at this data size should display quite clearly on speed and memory gains attainable.

@mattdowle
Copy link
Member

For memory usage, perhaps: https:/gsauthof/cgmemtime

@arunsrinivasan
Copy link
Member Author

Figured cgmemtime out. Quite straightforward. Just need some time to run through grouping/joins/reshaping benchmarks.

@jangorecki
Copy link
Member

I would like to close this one as it is already epic, and will be epic for a long time, due to broad scope defined here.
Work on benchmarks has been shifted to a dedicated h2o project https:/h2oai/db-benchmark and related issues are:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants