Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange results from a non-equi join with multiple conditions #2275

Closed
franknarf1 opened this issue Jul 20, 2017 · 2 comments
Closed

Strange results from a non-equi join with multiple conditions #2275

franknarf1 opened this issue Jul 20, 2017 · 2 comments
Labels
bug non-equi joins rolling, overlapping, non-equi joins
Milestone

Comments

@franknarf1
Copy link
Contributor

franknarf1 commented Jul 20, 2017

This was brought up on SO.

The goal is to find out if each row in DT1 has a match in DT2 in the sense of on=.(RANDOM_STRING, DATE >= START_DATE, DATE <= EXPIRY_DATE):

set.seed(123)
library(data.table)
library(stringi)
# Sorry that it requires stringi; I couldn't find another way forward besides the OP's verbatim example.

n <- 100000

DT1 <- data.table(RANDOM_STRING = stri_rand_strings(n, 5, pattern = "[a-k]"),
                  DATE = sample(seq(as.Date('2016-01-01'), as.Date('2016-12-31'), by="day"), n, replace=T))

DT2 <- data.table(RANDOM_STRING = stri_rand_strings(n, 5, pattern = "[a-k]"),
                  START_DATE = sample(seq(as.Date('2015-01-01'), as.Date('2017-12-31'), by="day"), n, replace=T))

DT2[, EXPIRY_DATE := START_DATE + floor(runif(1000, 200,300))]

My usual approach is to do a join, counting matches with .N and by=.EACHI. However, the OP found that this fails here:

# correct result (takes a long time)

    DT1[, m_ok := DT2[.BY, on=.(RANDOM_STRING), inrange(DATE, START_DATE, EXPIRY_DATE)], by=RANDOM_STRING]

# my usual approach

    DT1[, MATCHED := FALSE]

    DT1[!(MATCHED), MATCHED := DT2[.SD, on=.(RANDOM_STRING, START_DATE <= DATE, EXPIRY_DATE >= DATE), .N, by=.EACHI ]$N > 0L ]

# comparison

    DT1[MATCHED != m_ok | (MATCHED & is.na(m_ok))] 

    # shows many cases of failure from the attempt
    # for example...

    DT1[RANDOM_STRING == "egkja"]; DT2[RANDOM_STRING == "egkja"]

    #    RANDOM_STRING       DATE MATCHED  m_ok
    # 1:         egkja 2016-05-19    TRUE FALSE
    # 2:         egkja 2016-06-02   FALSE FALSE
    # 3:         egkja 2016-05-20    TRUE FALSE
    # 4:         egkja 2016-03-16   FALSE  TRUE

    #    RANDOM_STRING START_DATE EXPIRY_DATE
    # 1:         egkja 2015-09-07  2016-04-17

There is probably a way to come up with the correct result in a less slow way (foverlaps?), but my point is that I expect the .N, by=.EACHI]$N > 0L way to work. Is it failing thanks to a bug or am I mistaken in using it here?

I had trouble making a smaller example. Drop the n parameter by a factor of 10 and you'll see that the problem disappears. Stranger, the OP noticed that if you repeatedly run the DT1[!(MATCHED), MATCHED := ... ] line, it will keep making changes over many iterations. Also, the OP said they couldn't construct an example when the on= condition only contained one inequality.

EDIT: one faster way of coming up with the correct result, thanks to SO OP:

w = DT1[DT2, on=.(RANDOM_STRING, DATE >= START_DATE, DATE <= EXPIRY_DATE), which=TRUE, nomatch=0]
DT1[, m := DT1_ID %in% w ]
@arunsrinivasan arunsrinivasan added bug non-equi joins rolling, overlapping, non-equi joins labels Nov 7, 2017
@arunsrinivasan arunsrinivasan added this to the v1.10.6 milestone Nov 7, 2017
@arunsrinivasan
Copy link
Member

Fixing #2360 also takes care of this. Please write back if not. Just issued the PR. Should be merged shortly, assuming tests pass.

TODO: update the SO post linked by Frank.

@mattdowle
Copy link
Member

Closed by Arun's PR #2461 (the "closes #2461" needs to appear in the PR's first comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug non-equi joins rolling, overlapping, non-equi joins
Projects
None yet
Development

No branches or pull requests

3 participants