Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"by = colA:colD" produces incorrect result when key = ("colA","colD") #4285

Closed
cbilot opened this issue Mar 7, 2020 · 1 comment · Fixed by #4376
Closed

"by = colA:colD" produces incorrect result when key = ("colA","colD") #4285

cbilot opened this issue Mar 7, 2020 · 1 comment · Fixed by #4376
Assignees
Milestone

Comments

@cbilot
Copy link

cbilot commented Mar 7, 2020

Here's a simple reproducible example that gets to the point quickly. Let's start with a very simple data.table:
DT <- data.table( col1 = c(1, 1, 1), col2 = c("a", "b", "a"), col3 = c("A", "B", "A"), col4 = c(2, 2, 2) )
print(DT)

col1 col2 col3 col4
1: 1 a A 2
2: 1 b B 2
3: 1 a A 2

Note that rows 1 & 3 are identical, with a differing row (2) between them. This "interrupting" row is key to the bug that follows.

If we run a simple grouping using columns 1 through 4 using two different syntax, we get the same (correct) result:
DT[, .N, by = c("col1", "col2", "col3", "col4")]

col1 col2 col3 col4 N
1: 1 a A 2 2
2: 1 b B 2 1

DT[, .N, by = col1:col4]

col1 col2 col3 col4 N
1: 1 a A 2 2
2: 1 b B 2 1

Now, let's set a key, using columns 1 & 4, and re-run the above grouping commands:
setkey(DT, col1, col4)
key(DT)

[1] "col1" "col4"

DT[, .N, by = c("col1", "col2", "col3", "col4")]

col1 col2 col3 col4 N
1: 1 a A 2 2
2: 1 b B 2 1

DT[, .N, by = col1:col4]

col1 col2 col3 col4 N
1: 1 a A 2 1
2: 1 b B 2 1
3: 1 a A 2 1

Notice that the "by = col1:col4" now produces a different result.

Removing the key -- or setting some key other than ("col1", "col4") -- will restore the correct results for both syntax. (Not shown)

It's as though the presence of the key ("col1", "col4") induces the "by=col1:col4" syntax to assume that the data.table is already sorted by (col1, col2, col3, col4). And thus, the intervening row (2) causes the grouping to miss later matching row.

So far, I have noticed this bug in only one case: when the key is ("colB", "colG") and the same two columns are named as endpoints in the by ":" syntax ("by = colB:colG").

FWIW, today is my first time ever using GitHub, so please forgive if I've missed something. (I joined today so that I could report what I noticed.) I searched the NEWS, the development version, open issues, and stack overflow .. but I found nothing similar. Perhaps I don't know the correct search terms ... or perhaps this is an edge case.

As a mitigation for now, I've resorted to using only the "by = c("colA","colB", ..) syntax. The colB:colG syntax is very convenient for ad-hoc analysis, which is a good share of my daily work.

sessionInfo()

R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base

other attached packages:
readODS_1.6.7 nvimcom_0.9-85 rvisidata_0.0.0.9000 data.table_1.12.8 colorout_1.2-2

loaded via a namespace (and not attached):
[1] Rcpp_1.0.3 crayon_1.3.4 cellranger_1.1.0 R6_2.4.1 magrittr_1.5 pillar_1.4.3 stringi_1.4.6 rlang_0.4.5 xml2_1.2.2
[10] vctrs_0.2.3 tools_3.6.3 readr_1.3.1 hms_0.5.3 compiler_3.6.3 pkgconfig_2.0.3 tibble_2.1.3

@ColeMiller1
Copy link
Contributor

Thanks for the report - I can confirm bug. Lines 749, 752, and 753 are the offending lines:

data.table/R/data.table.R

Lines 749 to 753 in b1b1832

allbyvars = intersect(all.vars(bysub), names_x)
orderedirows = .Call(CisOrderedSubset, irows, nrow(x)) # TRUE when irows is NULL (i.e. no i clause). Similar but better than is.sorted(f__)
bysameorder = byindex = FALSE
if (all(vapply_1b(bysubl, is.name))) {
bysameorder = orderedirows && haskey(x) && length(allbyvars) && identical(allbyvars,head(key(x),length(allbyvars)))

PR #4248 will close this.

@jangorecki jangorecki added this to the 1.12.9 milestone Apr 5, 2020
@ColeMiller1 ColeMiller1 self-assigned this Apr 12, 2020
@jangorecki jangorecki modified the milestones: 1.12.11, 1.12.9 May 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants