faster setkeyv if keys already exist. #2332

MarkusBonsch · 2017-09-06T06:25:58Z

Closes #2331.

A small benchmark highlights the benefit of the new implementation (code at the end of the comment):

DT is a data.table with 1e7 rows. x1 and x2 are columns with 10 groups each (1:10), x3 is numeric random numbers.
Time is median time in milliseconds of 100 iterations

		master	pull request
1	no existing key ; setkey(DT, x1, x2)	441	445
2	existing key on x1 ; setkey(DT, x1, x2)	284	286
3	existing key on x1 and x2 ; setkey(DT, x1, x2)	111	49
4	existing key on x1, x2, and x3 ; setkey(DT, x1, x2)	110	49

Cases 3 and 4 see a significant speed improvement since forderv is not called.

The only drawback is that the internal check that the data.table is really sorted by the key is not performed anymore. This has two implications that are absolutely acceptable from my perspective:

The internal logic of data.table is not tested in each call of setkeyv. I believe testing the sanity of the internal logic is a task for test.data.table(), and should not hamper the speed of a function that is called frequently.
If users 'go under the hood' and manipulate the sorted attribute manually, it is not guaranteed that the key is still correct. I believe that users that go under the hood should be expected to know what they do. No need to maintain a costly check for this very specific case.

Of course, I had to remove the test that checked this warning.

Here is the code for my benchmark:

library(data.table)
library(microbenchmark)

## timing for setkey
set.seed(100)
nrow <- 1e7
dat_nokey <- data.table(x1 = sample(1:10,nrow, replace = TRUE), x2 = sample(1:10, replace = TRUE), x3 = rnorm(nrow))

dat_onekey <- copy(dat_nokey)
setkey(dat_onekey, x1, verbose = TRUE)

dat_twokey <- copy(dat_nokey)
setkey(dat_twokey, x1, x2, verbose = TRUE)

dat_threekey <- copy(dat_nokey)
setkey(dat_threekey, x1, x2, x3, verbose = TRUE)

timing_setkey <- microbenchmark(nokey =    setkey(copy(dat_nokey), x1, x2),
                                onekey =   setkey(copy(dat_onekey), x1, x2),
                                twokey =   setkey(copy(dat_twokey), x1, x2),
                                threekey = setkey(copy(dat_threekey), x1, x2),
                                times = 100)

## test that all results are identical (except for threekey which is expected to be different)
nokey    = setkey(copy(dat_nokey), x1, x2)
onekey   = setkey(copy(dat_onekey), x1, x2, verbose = TRUE)
twokey   = setkey(copy(dat_twokey), x1, x2)
threekey = setkey(copy(dat_threekey), x1, x2)

if(!identical(nokey, onekey)) stop("Error")
if(!identical(nokey, twokey)) stop("Error")

codecov-io · 2017-09-06T06:41:14Z

Codecov Report

Merging #2332 into master will increase coverage by 0.03%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2332      +/-   ##
==========================================
+ Coverage   90.92%   90.96%   +0.03%     
==========================================
  Files          61       61              
  Lines       11804    11808       +4     
==========================================
+ Hits        10733    10741       +8     
+ Misses       1071     1067       -4

Impacted Files	Coverage Δ
R/setkey.R	`93.6% <100%> (+0.09%)`	⬆️
src/forder.c	`94.47% <0%> (+0.52%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c98142c...96019fa. Read the comment docs.

… into respectKey

mattdowle · 2017-09-06T18:37:22Z

Well spotted! Thanks for the benchmarks, explaining why test 347 was removed up front and including the news item. That all made it easy to accept.

setkeyv is much faster now, if keys already exist.

90bd521

NORDEXAG\BonschM added 2 commits September 6, 2017 10:20

setkeyv is much faster now, if keys already exist.

6624d54

Merge branch 'respectKey' of https:/MarkusBonsch/data.table…

96019fa

… into respectKey

mattdowle added this to the v1.10.6 milestone Sep 6, 2017

mattdowle merged commit 991f90e into Rdatatable:master Sep 6, 2017

mattdowle mentioned this pull request Sep 6, 2017

[Request] setkeyv can be accelerated if key already exists #2331

Closed

MarkusBonsch deleted the respectKey branch September 6, 2017 19:02

mattdowle mentioned this pull request Sep 6, 2017

Add is.valid(DT) function #2334

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

faster setkeyv if keys already exist. #2332

faster setkeyv if keys already exist. #2332

MarkusBonsch commented Sep 6, 2017 •

edited by mattdowle

Loading

codecov-io commented Sep 6, 2017 •

edited

Loading

mattdowle commented Sep 6, 2017

faster setkeyv if keys already exist. #2332

faster setkeyv if keys already exist. #2332

Conversation

MarkusBonsch commented Sep 6, 2017 • edited by mattdowle Loading

codecov-io commented Sep 6, 2017 • edited Loading

Codecov Report

mattdowle commented Sep 6, 2017

MarkusBonsch commented Sep 6, 2017 •

edited by mattdowle

Loading

codecov-io commented Sep 6, 2017 •

edited

Loading