"by = colA:colD" produces incorrect result when key = ("colA","colD") #4285

cbilot · 2020-03-07T02:46:07Z

Here's a simple reproducible example that gets to the point quickly. Let's start with a very simple data.table:
DT <- data.table( col1 = c(1, 1, 1), col2 = c("a", "b", "a"), col3 = c("A", "B", "A"), col4 = c(2, 2, 2) )
print(DT)

col1 col2 col3 col4
1: 1 a A 2
2: 1 b B 2
3: 1 a A 2

Note that rows 1 & 3 are identical, with a differing row (2) between them. This "interrupting" row is key to the bug that follows.

If we run a simple grouping using columns 1 through 4 using two different syntax, we get the same (correct) result:
DT[, .N, by = c("col1", "col2", "col3", "col4")]

col1 col2 col3 col4 N
1: 1 a A 2 2
2: 1 b B 2 1

DT[, .N, by = col1:col4]

col1 col2 col3 col4 N
1: 1 a A 2 2
2: 1 b B 2 1

Now, let's set a key, using columns 1 & 4, and re-run the above grouping commands:
setkey(DT, col1, col4)
key(DT)

[1] "col1" "col4"

DT[, .N, by = c("col1", "col2", "col3", "col4")]

col1 col2 col3 col4 N
1: 1 a A 2 2
2: 1 b B 2 1

DT[, .N, by = col1:col4]

col1 col2 col3 col4 N
1: 1 a A 2 1
2: 1 b B 2 1
3: 1 a A 2 1

Notice that the "by = col1:col4" now produces a different result.

Removing the key -- or setting some key other than ("col1", "col4") -- will restore the correct results for both syntax. (Not shown)

It's as though the presence of the key ("col1", "col4") induces the "by=col1:col4" syntax to assume that the data.table is already sorted by (col1, col2, col3, col4). And thus, the intervening row (2) causes the grouping to miss later matching row.

So far, I have noticed this bug in only one case: when the key is ("colB", "colG") and the same two columns are named as endpoints in the by ":" syntax ("by = colB:colG").

FWIW, today is my first time ever using GitHub, so please forgive if I've missed something. (I joined today so that I could report what I noticed.) I searched the NEWS, the development version, open issues, and stack overflow .. but I found nothing similar. Perhaps I don't know the correct search terms ... or perhaps this is an edge case.

As a mitigation for now, I've resorted to using only the "by = c("colA","colB", ..) syntax. The colB:colG syntax is very convenient for ad-hoc analysis, which is a good share of my daily work.

sessionInfo()

R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base

other attached packages:
readODS_1.6.7 nvimcom_0.9-85 rvisidata_0.0.0.9000 data.table_1.12.8 colorout_1.2-2

loaded via a namespace (and not attached):
[1] Rcpp_1.0.3 crayon_1.3.4 cellranger_1.1.0 R6_2.4.1 magrittr_1.5 pillar_1.4.3 stringi_1.4.6 rlang_0.4.5 xml2_1.2.2
[10] vctrs_0.2.3 tools_3.6.3 readr_1.3.1 hms_0.5.3 compiler_3.6.3 pkgconfig_2.0.3 tibble_2.1.3

The text was updated successfully, but these errors were encountered:

ColeMiller1 · 2020-03-07T09:38:30Z

Thanks for the report - I can confirm bug. Lines 749, 752, and 753 are the offending lines:

data.table/R/data.table.R

Lines 749 to 753 in b1b1832

 allbyvars = intersect(all.vars(bysub), names_x) 

 orderedirows = .Call(CisOrderedSubset, irows, nrow(x)) # TRUE when irows is NULL (i.e. no i clause). Similar but better than is.sorted(f__) 

 bysameorder = byindex = FALSE 

 if (all(vapply_1b(bysubl, is.name))) { 

 bysameorder = orderedirows && haskey(x) && length(allbyvars) && identical(allbyvars,head(key(x),length(allbyvars)))

PR #4248 will close this.

jangorecki added this to the 1.12.9 milestone Apr 5, 2020

ColeMiller1 self-assigned this Apr 12, 2020

ColeMiller1 mentioned this issue Apr 12, 2020

By colon key fix #4376

Merged

jangorecki modified the milestones: 1.12.11, 1.12.9 May 27, 2020

mattdowle closed this as completed in #4376 Jun 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"by = colA:colD" produces incorrect result when key = ("colA","colD") #4285

"by = colA:colD" produces incorrect result when key = ("colA","colD") #4285

cbilot commented Mar 7, 2020

ColeMiller1 commented Mar 7, 2020

"by = colA:colD" produces incorrect result when key = ("colA","colD") #4285

"by = colA:colD" produces incorrect result when key = ("colA","colD") #4285

Comments

cbilot commented Mar 7, 2020

ColeMiller1 commented Mar 7, 2020