Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

faster setkeyv if keys already exist. #2332

Merged
merged 3 commits into from
Sep 6, 2017

Conversation

MarkusBonsch
Copy link
Contributor

@MarkusBonsch MarkusBonsch commented Sep 6, 2017

Closes #2331.

A small benchmark highlights the benefit of the new implementation (code at the end of the comment):

DT is a data.table with 1e7 rows. x1 and x2 are columns with 10 groups each (1:10), x3 is numeric random numbers.
Time is median time in milliseconds of 100 iterations

master pull request
1 no existing key ; setkey(DT, x1, x2) 441 445
2 existing key on x1 ; setkey(DT, x1, x2) 284 286
3 existing key on x1 and x2 ; setkey(DT, x1, x2) 111 49
4 existing key on x1, x2, and x3 ; setkey(DT, x1, x2) 110 49

Cases 3 and 4 see a significant speed improvement since forderv is not called.

The only drawback is that the internal check that the data.table is really sorted by the key is not performed anymore. This has two implications that are absolutely acceptable from my perspective:

  1. The internal logic of data.table is not tested in each call of setkeyv. I believe testing the sanity of the internal logic is a task for test.data.table(), and should not hamper the speed of a function that is called frequently.
  2. If users 'go under the hood' and manipulate the sorted attribute manually, it is not guaranteed that the key is still correct. I believe that users that go under the hood should be expected to know what they do. No need to maintain a costly check for this very specific case.

Of course, I had to remove the test that checked this warning.

Here is the code for my benchmark:

library(data.table)
library(microbenchmark)

## timing for setkey
set.seed(100)
nrow <- 1e7
dat_nokey <- data.table(x1 = sample(1:10,nrow, replace = TRUE), x2 = sample(1:10, replace = TRUE), x3 = rnorm(nrow))

dat_onekey <- copy(dat_nokey)
setkey(dat_onekey, x1, verbose = TRUE)

dat_twokey <- copy(dat_nokey)
setkey(dat_twokey, x1, x2, verbose = TRUE)

dat_threekey <- copy(dat_nokey)
setkey(dat_threekey, x1, x2, x3, verbose = TRUE)

timing_setkey <- microbenchmark(nokey =    setkey(copy(dat_nokey), x1, x2),
                                onekey =   setkey(copy(dat_onekey), x1, x2),
                                twokey =   setkey(copy(dat_twokey), x1, x2),
                                threekey = setkey(copy(dat_threekey), x1, x2),
                                times = 100)

## test that all results are identical (except for threekey which is expected to be different)
nokey    = setkey(copy(dat_nokey), x1, x2)
onekey   = setkey(copy(dat_onekey), x1, x2, verbose = TRUE)
twokey   = setkey(copy(dat_twokey), x1, x2)
threekey = setkey(copy(dat_threekey), x1, x2)

if(!identical(nokey, onekey)) stop("Error")
if(!identical(nokey, twokey)) stop("Error")

@codecov-io
Copy link

codecov-io commented Sep 6, 2017

Codecov Report

Merging #2332 into master will increase coverage by 0.03%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2332      +/-   ##
==========================================
+ Coverage   90.92%   90.96%   +0.03%     
==========================================
  Files          61       61              
  Lines       11804    11808       +4     
==========================================
+ Hits        10733    10741       +8     
+ Misses       1071     1067       -4
Impacted Files Coverage Δ
R/setkey.R 93.6% <100%> (+0.09%) ⬆️
src/forder.c 94.47% <0%> (+0.52%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c98142c...96019fa. Read the comment docs.

@mattdowle mattdowle added this to the v1.10.6 milestone Sep 6, 2017
@mattdowle mattdowle merged commit 991f90e into Rdatatable:master Sep 6, 2017
@mattdowle
Copy link
Member

Well spotted! Thanks for the benchmarks, explaining why test 347 was removed up front and including the news item. That all made it easy to accept.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants