Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow cross join in [.data.table #1717

Open
jangorecki opened this issue May 20, 2016 · 17 comments · May be fixed by #4544
Open

allow cross join in [.data.table #1717

jangorecki opened this issue May 20, 2016 · 17 comments · May be fixed by #4544
Assignees
Labels
feature request top request One of our most-requested issues

Comments

@jangorecki
Copy link
Member

jangorecki commented May 20, 2016

There is not straight way to do cross join currently. User need to add a column with a constant value in both datasets and join on that column. This could be nicely addressed in a similar way as it is in SQL where you can use SELECT ... FROM t1 JOIN t2 ON 1=1, where 1=1 is used to evaluate to TRUE for every row.
Eventually allow.cartesian could be set to TRUE when on=TRUE detected so the use case would look like:

X[Y, on=TRUE]

And corresponding SQLite

library(RSQLite)
library(data.table)
X=data.table(a=1:2)
Y=data.table(b=letters[1:2])
conn=dbConnect(SQLite())
dbWriteTable(conn, "t1", X)
dbWriteTable(conn, "t2", Y)
dbGetQuery(conn, "SELECT * FROM t1 JOIN t2 ON 1=1;")
#  a b
#1 1 a
#2 1 b
#3 2 a
#4 2 b

Update https://stackoverflow.com/questions/25888706/r-data-table-cross-join-not-working when solved.

@SymbolixAU

This comment has been minimized.

@jangorecki

This comment has been minimized.

@MichaelChirico

This comment has been minimized.

@jangorecki

This comment has been minimized.

@chinsoon12

This comment has been minimized.

@MichaelChirico

This comment has been minimized.

@MichaelChirico

This comment has been minimized.

@ToeKneeFan
Copy link

@MichaelChirico That's a nice, intuitive solution. It's worth cautioning, though, that it will result in duplicate column names if any are shared between the data.tables. Based on some (very nonexhaustive, probably memory-dependent) benchmarking, your solution (CJ1) is actually slightly faster than jangorecki's solution in the StackOverflow response (CJ2, modified to be more comparable) but has similar performance if the temporary joining column is assigned with :=. Of course, this is without manually taking into account keying and indexing behavior.

library(data.table)
DT1 <- data.table(V1 = rnorm(8E3))
DT2 <- data.table(V1 = rnorm(8E3))

# joining by grouping
CJ1 <- function(DT1, DT2){
	DT1[ , c(.SD, DT2), by = seq_len(nrow(DT1))]
}

# joining by allow.cartesian with temporary column
CJ2 <- function(DT1, DT2){
	DT1[, c(.SD, .tempCol = 1)][DT2[, c(.SD, .tempCol = 1)], on = ".tempCol", , allow.cartesian = T]
}

# joining by allow.cartesian with assigned column
CJ3 <- function(DT1, DT2){
	DT1[, .tempCol := 1]
	DT2[, .tempCol := 1]
	DT1[DT2, on = ".tempCol", , allow.cartesian = T]
}

join1 <- CJ1(DT1, DT2)
print(join1)
#           seq_len         V1          V1
#        1:       1 -0.6752308 -1.19897585
#        2:       1 -0.6752308 -2.14232417
#        3:       1 -0.6752308 -0.01960594
#        4:       1 -0.6752308  0.92697292
#        5:       1 -0.6752308  1.26429973
#       ---                               
# 63999996:    8000  1.1786346 -0.44214904
# 63999997:    8000  1.1786346  0.52729396
# 63999998:    8000  1.1786346  0.60022166
# 63999999:    8000  1.1786346  0.30667155
# 64000000:    8000  1.1786346 -0.77470274

join2 <- CJ2(DT1, DT2)
print(join2)
#                    V1 .tempCol       i.V1
#        1: -0.67523085        1 -1.1989758
#        2: -0.92701043        1 -1.1989758
#        3: -0.99014335        1 -1.1989758
#        4:  0.53156215        1 -1.1989758
#        5: -0.15562568        1 -1.1989758
#       ---                                
# 63999996:  0.39169500        1 -0.7747027
# 63999997:  1.94071323        1 -0.7747027
# 63999998: -0.08114057        1 -0.7747027
# 63999999: -0.31676249        1 -0.7747027
# 64000000:  1.17863463        1 -0.7747027

join3 <- CJ3(DT1, DT2)
print(join3)
#                    V1 .tempCol       i.V1
#        1: -0.67523085        1 -1.1989758
#        2: -0.92701043        1 -1.1989758
#        3: -0.99014335        1 -1.1989758
#        4:  0.53156215        1 -1.1989758
#        5: -0.15562568        1 -1.1989758
#       ---                                
# 63999996:  0.39169500        1 -0.7747027
# 63999997:  1.94071323        1 -0.7747027
# 63999998: -0.08114057        1 -0.7747027
# 63999999: -0.31676249        1 -0.7747027
# 64000000:  1.17863463        1 -0.7747027

# check results are the same
setnames(join1, which(names(join1) == "V1"), c("V1", "i.V1"))
join1 <- join1[, "seq_len" := NULL]
join2 <- join2[, ".tempCol" := NULL]
join3 <- join3[, ".tempCol" := NULL]
setorderv(join1, c("V1", "i.V1"))
setorderv(join2, c("V1", "i.V1"))
setorderv(join3, c("V1", "i.V1"))
identical(join1, join2)
# [1] TRUE
identical(join2, join3)
# [1] TRUE

# benchmark
microbenchmark::microbenchmark(
	CJ1(DT1, DT2),
	CJ2(DT1, DT2),
	CJ3(DT1, DT2),
	times = 5
)

# Unit: seconds
#           expr      min       lq     mean   median       uq      max neval cld
#  CJ1(DT1, DT2) 2.486950 2.638036 2.767990 2.822850 2.842497 3.049616     5   a
#  CJ2(DT1, DT2) 1.743497 3.531119 3.751018 3.943120 4.090093 5.447262     5   a
#  CJ3(DT1, DT2) 1.848768 2.728135 2.737173 2.822515 3.094846 3.191602     5   a

@jangorecki jangorecki added this to the 1.13.0 milestone Jul 31, 2019
@chinsoon12
Copy link

Another option:

CJ.dt_2 <- function(...) {
    DTls <- list(...)
    rows <- do.call(CJ, lapply(DTls, function(x) x[, seq_len(.N)]))
    res <- DTls[[1L]][rows[[1L]]]
    for (n in seq_along(DTls)[-1L])
        res <- res[, c(.SD, DTls[[n]][rows[[n]]])]
    res
}

timing code:

library(data.table) #data.table_1.12.4 Win-4 R-3.6.1 x64 4 threads
nr <- 3e2
DT1 <- data.table(A1=1:nr, A1=1:nr)
DT2 <- data.table(B1=1:nr, B2=1:nr)
DT3 <- data.table(C1=1:nr, C2=1:nr)

#https:/Rdatatable/data.table/pull/814#issuecomment-55807497
CJ.dt = function(...) {
    rows = do.call(CJ, lapply(list(...), function(x) if(is.data.frame(x)) seq_len(nrow(x)) else seq_along(x)));
    do.call(data.table, Map(function(x, y) x[y], list(...), rows))
}

#https:/Rdatatable/data.table/issues/1717#issuecomment-355499952
CJ.dt_1 <- function(...) {
    Reduce(f=function(x, y) cbind(x[rep(1:nrow(x), times=nrow(y)),], y[rep(1:nrow(y), each=nrow(x)),]),
        x=list(...))
} #CJ.dft

CJ.dt_2 <- function(...) {
    DTls <- list(...)
    rows <- do.call(CJ, lapply(DTls, function(x) x[, seq_len(.N)]))
    res <- DTls[[1L]][rows[[1L]]]
    for (n in seq_along(DTls)[-1L])
        res <- res[, c(.SD, DTls[[n]][rows[[n]]])]
    res
}

#https:/Rdatatable/data.table/issues/2343#issuecomment-328156867
CJDT <- function(...)
    Reduce(function(DT1, DT2) cbind(DT1, DT2[rep(1:.N, each=nrow(DT1))]), list(...))

a1 <- CJ.dt(DT1, DT2, DT3)
setorderv(a1, names(a1))
a2 <- CJ.dt_1(DT1, DT2, DT3)
setorderv(a2, names(a2))
a3 <- CJ.dt_2(DT1, DT2, DT3)
setorderv(a3, names(a3))
a4 <- CJDT(DT1, DT2, DT3)
setorderv(a4, names(a4))
identical(a1, a2)
identical(a1, a3)
identical(a1, a4)

bench::mark(CJ.dt(DT1, DT2, DT3),
    CJ.dt_1(DT1, DT2, DT3),
    CJ.dt_2(DT1, DT2, DT3),
    CJDT(DT1, DT2, DT3),
    check=FALSE)

timings:


# A tibble: 4 x 13
  expression                  min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result                    memory                time     gc              
  <bch:expr>             <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>                    <list>                <list>   <list>          
1 CJ.dt(DT1, DT2, DT3)      1.52m    1.52m    0.0109    1.61GB   0.0328     1     3      1.52m <df[,6] [27,000,000 x 6]> <df[,3] [341 x 3]>    <bch:tm> <tibble [1 x 3]>
2 CJ.dt_1(DT1, DT2, DT3)    1.53s    1.53s    0.653     1.41GB   1.31       1     2      1.53s <df[,6] [27,000,000 x 6]> <df[,3] [1,467 x 3]>  <bch:tm> <tibble [1 x 3]>
3 CJ.dt_2(DT1, DT2, DT3)    1.27s    1.27s    0.787     2.35GB   0.787      1     1      1.27s <df[,6] [27,000,000 x 6]> <df[,3] [23,236 x 3]> <bch:tm> <tibble [1 x 3]>
4 CJDT(DT1, DT2, DT3)       1.07s    1.07s    0.930   929.44MB   0.930      1     1      1.07s <df[,6] [27,000,000 x 6]> <df[,3] [28 x 3]>     <bch:tm> <tibble [1 x 3]>

Another test:

DT_A <- data.table(A=1:8e3)
DT_B <- data.table(B=1:8e3)
bench::mark(CJ.dt(DT_A, DT_B), 
    CJ.dt_1(DT_A, DT_B), CJ.dt_2(DT_A, DT_B), CJDT(DT_A, DT_B), check=FALSE)
# A tibble: 4 x 13
  expression               min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result                    memory             time     gc              
  <bch:expr>          <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>                    <list>             <list>   <list>          
1 CJ.dt(DT_A, DT_B)      1.32m    1.32m    0.0126    1.53GB   0.0126     1     1      1.32m <df[,2] [64,000,000 x 2]> <df[,3] [217 x 3]> <bch:tm> <tibble [1 x 3]>
2 CJ.dt_1(DT_A, DT_B)    2.77s    2.77s    0.362     1.43GB   0.362      1     1      2.77s <df[,2] [64,000,000 x 2]> <df[,3] [19 x 3]>  <bch:tm> <tibble [1 x 3]>
3 CJ.dt_2(DT_A, DT_B) 919.28ms 919.28ms    1.09      1.67GB   0          1     0   919.28ms <df[,2] [64,000,000 x 2]> <df[,3] [38 x 3]>  <bch:tm> <tibble [1 x 3]>
4 CJDT(DT_A, DT_B)       2.24s    2.24s    0.446   976.63MB   0.446      1     1      2.24s <df[,2] [64,000,000 x 2]> <df[,3] [19 x 3]>  <bch:tm> <tibble [1 x 3]>

@mattdowle mattdowle modified the milestones: 1.12.7, 1.12.9 Dec 8, 2019
@jangorecki
Copy link
Member Author

jangorecki commented May 10, 2020

This is a low-overhead version built on top of #4370, not really tested

crossjoin = function(x, i) {
  stopifnot(is.data.table(x), is.data.table(i))
  ni = nrow(i)
  nx = nrow(x)
  ## bmerge ans for cross join
  ans = list(starts = rep.int(1L, ni), lens = rep.int(nx, ni))
  ## dtmerge ans for cross join
  ans = list(xrows = vecseq(ans$starts, ans$lens, NULL), irows = seqexp(ans$lens))
  out.i = .Call(CsubsetDT, i, ans$irows, colnamesInt(i, NULL))
  out.x = .Call(CsubsetDT, x, ans$xrows, colnamesInt(x, NULL))
  out = .Call(Ccbindlist, list(out.i, out.x), FALSE)
  setDT(out)
  out
}

@jangorecki
Copy link
Member Author

@chinsoon12 benchmark you presented suffers badly in terms of readability

  expression                  min
  <bch:expr>             <bch:tm>       
1 CJ.dt(DT1, DT2, DT3)      1.52m
2 CJ.dt_1(DT1, DT2, DT3)    1.53s
3 CJ.dt_2(DT1, DT2, DT3)    1.27s
4 CJDT(DT1, DT2, DT3)       1.07s

how many people will notice that first expression has timing in minutes while all remaning ones in seconds?

@jangorecki
Copy link
Member Author

jangorecki commented May 10, 2020

Are there practical uses cases for a cross join on more than 2 tables?
Above crossjoin could be tweaked to apply Reduce when producing irows and xrows, then call CsubsetDT just single time for i and x, and not for every pair of tables.


I re-run benchmark

library(data.table)
nr <- 3e2
DT1 <- data.table(A1=1:nr, A1=1:nr)
DT2 <- data.table(B1=1:nr, B2=1:nr)
DT3 <- data.table(C1=1:nr, C2=1:nr)

#https:/Rdatatable/data.table/pull/814#issuecomment-55807497
CJ.dt = function(...) {
    rows = do.call(CJ, lapply(list(...), function(x) if(is.data.frame(x)) seq_len(nrow(x)) else seq_along(x)));
    do.call(data.table, Map(function(x, y) x[y], list(...), rows))
}

#https:/Rdatatable/data.table/issues/1717#issuecomment-355499952
CJ.dt_1 <- function(...) {
    Reduce(f=function(x, y) cbind(x[rep(1:nrow(x), times=nrow(y)),], y[rep(1:nrow(y), each=nrow(x)),]),
        x=list(...))
} #CJ.dft

CJ.dt_2 <- function(...) {
    DTls <- list(...)
    rows <- do.call(CJ, lapply(DTls, function(x) x[, seq_len(.N)]))
    res <- DTls[[1L]][rows[[1L]]]
    for (n in seq_along(DTls)[-1L])
        res <- res[, c(.SD, DTls[[n]][rows[[n]]])]
    res
}

#https:/Rdatatable/data.table/issues/2343#issuecomment-328156867
CJDT <- function(...)
    Reduce(function(DT1, DT2) cbind(DT1, DT2[rep(1:.N, each=nrow(DT1))]), list(...))

a1 <- CJ.dt(DT1, DT2, DT3)
setorderv(a1, names(a1))
a2 <- CJ.dt_1(DT1, DT2, DT3)
setorderv(a2, names(a2))
a3 <- CJ.dt_2(DT1, DT2, DT3)
setorderv(a3, names(a3))
a4 <- CJDT(DT1, DT2, DT3)
setorderv(a4, names(a4))
identical(a1, a2)
identical(a1, a3)
identical(a1, a4)

crossjoin = data.table:::crossjoin
a5 = crossjoin(crossjoin(DT1, DT2), DT3)
setcolorder(a5, c(5:6,3:4,1:2))
setorderv(a5, names(a5))
identical(a1, a5)

DT_A <- data.table(A=1:8e3)
DT_B <- data.table(B=1:8e3)

mark1 = function(..., check=NULL) {
  l = as.list(substitute(list(...)))[-1L]
  txt = vapply(l, deparse, "")
  env = parent.frame()
  sec = vapply(l, function(expr) system.time(eval(expr, env=env))[[3L]], 0.0)
  data.frame(expr = txt, sec = sec)
}
cat("# run 1\n")
mark1(CJ.dt(DT_A, DT_B), CJ.dt_1(DT_A, DT_B), CJ.dt_2(DT_A, DT_B), CJDT(DT_A, DT_B), crossjoin(DT_A, DT_B), check=FALSE)
cat("\n# run 2\n")
mark1(CJ.dt(DT_A, DT_B), CJ.dt_1(DT_A, DT_B), CJ.dt_2(DT_A, DT_B), CJDT(DT_A, DT_B), crossjoin(DT_A, DT_B), check=FALSE)

timings

# run 1
                   expr    sec
1     CJ.dt(DT_A, DT_B) 65.542
2   CJ.dt_1(DT_A, DT_B)  2.668
3   CJ.dt_2(DT_A, DT_B)  0.743
4      CJDT(DT_A, DT_B)  2.121
5 crossjoin(DT_A, DT_B)  0.459

# run 2
                   expr    sec
1     CJ.dt(DT_A, DT_B) 65.424
2   CJ.dt_1(DT_A, DT_B)  2.664
3   CJ.dt_2(DT_A, DT_B)  0.723
4      CJDT(DT_A, DT_B)  2.121
5 crossjoin(DT_A, DT_B)  0.458

@chinsoon12

This comment has been minimized.

@chinsoon12

This comment has been minimized.

@jangorecki

This comment has been minimized.

@Kamgang-B
Copy link
Contributor

I have just missed this feature.

My opinion on this:

1- I think that this feature nicely fits into CJ function in the sense that CJ could be expanded to support data.frame (or list) too. I find something like CJ(DT1, DT2, etc.) very nice and natural. I don't really see the benefit of implementing this feature in a new function.

2- cross join could also be supported in [.data.table function. X[Y, on=TRUE] is nice but I suggest to also consider using the keyword .CROSSJOIN in on=, like X[Y, on=.CROSSJOIN]. The latter one is likely more intuitive (specially for users not used to SQL) and I feel that it would be somehow consistent with the way the natural join looks like in data.table (X[Y, on=.NATURAL])

An important remark: The first point (implementation in CJ allows to cross join more than two datasets at once while the second point is limited to only two.

@jangorecki
Copy link
Member Author

You are welcome to review cross join functionality proposed in #4370

@MichaelChirico MichaelChirico added the top request One of our most-requested issues label Apr 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request top request One of our most-requested issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants