-
Notifications
You must be signed in to change notification settings - Fork 979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"by = colA:colD" produces incorrect result when key = ("colA","colD") #4285
Milestone
Comments
Thanks for the report - I can confirm bug. Lines 749, 752, and 753 are the offending lines: Lines 749 to 753 in b1b1832
PR #4248 will close this. |
Merged
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Here's a simple reproducible example that gets to the point quickly. Let's start with a very simple data.table:
DT <- data.table( col1 = c(1, 1, 1), col2 = c("a", "b", "a"), col3 = c("A", "B", "A"), col4 = c(2, 2, 2) )
print(DT)
Note that rows 1 & 3 are identical, with a differing row (2) between them. This "interrupting" row is key to the bug that follows.
If we run a simple grouping using columns 1 through 4 using two different syntax, we get the same (correct) result:
DT[, .N, by = c("col1", "col2", "col3", "col4")]
DT[, .N, by = col1:col4]
Now, let's set a key, using columns 1 & 4, and re-run the above grouping commands:
setkey(DT, col1, col4)
key(DT)
DT[, .N, by = c("col1", "col2", "col3", "col4")]
DT[, .N, by = col1:col4]
Notice that the "by = col1:col4" now produces a different result.
Removing the key -- or setting some key other than ("col1", "col4") -- will restore the correct results for both syntax. (Not shown)
It's as though the presence of the key ("col1", "col4") induces the "by=col1:col4" syntax to assume that the data.table is already sorted by (col1, col2, col3, col4). And thus, the intervening row (2) causes the grouping to miss later matching row.
So far, I have noticed this bug in only one case: when the key is ("colB", "colG") and the same two columns are named as endpoints in the by ":" syntax ("by = colB:colG").
FWIW, today is my first time ever using GitHub, so please forgive if I've missed something. (I joined today so that I could report what I noticed.) I searched the NEWS, the development version, open issues, and stack overflow .. but I found nothing similar. Perhaps I don't know the correct search terms ... or perhaps this is an edge case.
As a mitigation for now, I've resorted to using only the "by = c("colA","colB", ..) syntax. The colB:colG syntax is very convenient for ad-hoc analysis, which is a good share of my daily work.
sessionInfo()
The text was updated successfully, but these errors were encountered: