Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FR: warn/error when updating and i has duplicates #2837

Closed
MichaelChirico opened this issue May 5, 2018 · 5 comments · Fixed by #3541
Closed

FR: warn/error when updating and i has duplicates #2837

MichaelChirico opened this issue May 5, 2018 · 5 comments · Fixed by #3541

Comments

@MichaelChirico
Copy link
Member

DT = data.table(a = 1:5)
DT[c(1, 1, 2), a := 3:5][]
#    a
# 1: 4
# 2: 5
# 3: 3
# 4: 4
# 5: 5

It seems the final element to be assigned (here 4 corresponds to the second instance of 1 in i). Not clear what the right behavior is in this case; my guess is most often it's a user mistake, hence a warning. But also possibly an error since "correct" behavior is ambiguous.

@st-pasha
Copy link
Contributor

st-pasha commented May 5, 2018

Detecting duplicates in i involves counting uniques, which has non-trivial cost when i is large. I'm not sure it's worth the effort to try to catch this exceedingly rare user error...

The current behavior is "correct" in the following sense: as we go along vector i, we assign to each index the corresponding element from j. Thus, we first assign 3 to into row 1, then assign 4 again into row 1, and finally assign 5 into row 2. Unfortunately, this "correctness" will no longer hold once we use OMP to evaluate this query (presumably it will be much larger in size). If two different threads are assigned to store different values into the same row, then the end result can be either of those values (or even anything "in-between", since we will not be doing atomic writes).

@MichaelChirico
Copy link
Member Author

not possible to use a hash table to say "seen"/"not seen" at each element?

what about an option to green light the speed hit?

also in the case of a keyed subset, the cost is trivial

@franknarf1
Copy link
Contributor

Somewhat related: the same problem during a join #2022

Fwiw, I find that I use both joins and Michael's approach, since often

# from the OP
mDT = data.table(row = c(1,1,2), new = 3:5)
DT[mDT$row, a := mDT$new]

is more expedient than adding a row-number column...

# adding a column and doing a join
mDT = data.table(row = c(1,1,2), new = 3:5)
DT[, row := .I]
DT[mDT, on=.(row), a := i.new]

so my code is vulnerable to both idioms until/unless I add a dupe check, anyDuplicated(mDT, by=on_cols) or something.

@jangorecki
Copy link
Member

jangorecki commented May 5, 2018

if we want to add so many checks we should also collect more attributes. We already keep info if object is sorted, we can also put info if any is NA, or if there are duplicates, uniqueN. So at least we can reduce overhead related to extra checks.

@MichaelChirico
Copy link
Member Author

Related: #2879. Agree with Pasha that potential speed hit should be the primary consideration re: whether to implement this. At a minimum, we should make sure there's a quick blurb in ?data.table highlighting the existing behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants