-
Notifications
You must be signed in to change notification settings - Fork 979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow cross join in [.data.table
#1717
Comments
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
@MichaelChirico That's a nice, intuitive solution. It's worth cautioning, though, that it will result in duplicate column names if any are shared between the data.tables. Based on some (very nonexhaustive, probably memory-dependent) benchmarking, your solution (
|
Another option:
timing code:
timings:
Another test:
|
This is a low-overhead version built on top of #4370, not really tested crossjoin = function(x, i) {
stopifnot(is.data.table(x), is.data.table(i))
ni = nrow(i)
nx = nrow(x)
## bmerge ans for cross join
ans = list(starts = rep.int(1L, ni), lens = rep.int(nx, ni))
## dtmerge ans for cross join
ans = list(xrows = vecseq(ans$starts, ans$lens, NULL), irows = seqexp(ans$lens))
out.i = .Call(CsubsetDT, i, ans$irows, colnamesInt(i, NULL))
out.x = .Call(CsubsetDT, x, ans$xrows, colnamesInt(x, NULL))
out = .Call(Ccbindlist, list(out.i, out.x), FALSE)
setDT(out)
out
} |
@chinsoon12 benchmark you presented suffers badly in terms of readability
how many people will notice that first expression has timing in minutes while all remaning ones in seconds? |
Are there practical uses cases for a cross join on more than 2 tables? I re-run benchmark library(data.table)
nr <- 3e2
DT1 <- data.table(A1=1:nr, A1=1:nr)
DT2 <- data.table(B1=1:nr, B2=1:nr)
DT3 <- data.table(C1=1:nr, C2=1:nr)
#https:/Rdatatable/data.table/pull/814#issuecomment-55807497
CJ.dt = function(...) {
rows = do.call(CJ, lapply(list(...), function(x) if(is.data.frame(x)) seq_len(nrow(x)) else seq_along(x)));
do.call(data.table, Map(function(x, y) x[y], list(...), rows))
}
#https:/Rdatatable/data.table/issues/1717#issuecomment-355499952
CJ.dt_1 <- function(...) {
Reduce(f=function(x, y) cbind(x[rep(1:nrow(x), times=nrow(y)),], y[rep(1:nrow(y), each=nrow(x)),]),
x=list(...))
} #CJ.dft
CJ.dt_2 <- function(...) {
DTls <- list(...)
rows <- do.call(CJ, lapply(DTls, function(x) x[, seq_len(.N)]))
res <- DTls[[1L]][rows[[1L]]]
for (n in seq_along(DTls)[-1L])
res <- res[, c(.SD, DTls[[n]][rows[[n]]])]
res
}
#https:/Rdatatable/data.table/issues/2343#issuecomment-328156867
CJDT <- function(...)
Reduce(function(DT1, DT2) cbind(DT1, DT2[rep(1:.N, each=nrow(DT1))]), list(...))
a1 <- CJ.dt(DT1, DT2, DT3)
setorderv(a1, names(a1))
a2 <- CJ.dt_1(DT1, DT2, DT3)
setorderv(a2, names(a2))
a3 <- CJ.dt_2(DT1, DT2, DT3)
setorderv(a3, names(a3))
a4 <- CJDT(DT1, DT2, DT3)
setorderv(a4, names(a4))
identical(a1, a2)
identical(a1, a3)
identical(a1, a4)
crossjoin = data.table:::crossjoin
a5 = crossjoin(crossjoin(DT1, DT2), DT3)
setcolorder(a5, c(5:6,3:4,1:2))
setorderv(a5, names(a5))
identical(a1, a5)
DT_A <- data.table(A=1:8e3)
DT_B <- data.table(B=1:8e3)
mark1 = function(..., check=NULL) {
l = as.list(substitute(list(...)))[-1L]
txt = vapply(l, deparse, "")
env = parent.frame()
sec = vapply(l, function(expr) system.time(eval(expr, env=env))[[3L]], 0.0)
data.frame(expr = txt, sec = sec)
}
cat("# run 1\n")
mark1(CJ.dt(DT_A, DT_B), CJ.dt_1(DT_A, DT_B), CJ.dt_2(DT_A, DT_B), CJDT(DT_A, DT_B), crossjoin(DT_A, DT_B), check=FALSE)
cat("\n# run 2\n")
mark1(CJ.dt(DT_A, DT_B), CJ.dt_1(DT_A, DT_B), CJ.dt_2(DT_A, DT_B), CJDT(DT_A, DT_B), crossjoin(DT_A, DT_B), check=FALSE) timings
|
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
I have just missed this feature. My opinion on this: 1- I think that this feature nicely fits into 2- cross join could also be supported in An important remark: The first point (implementation in |
You are welcome to review cross join functionality proposed in #4370 |
There is not straight way to do cross join currently. User need to add a column with a constant value in both datasets and join on that column. This could be nicely addressed in a similar way as it is in SQL where you can use
SELECT ... FROM t1 JOIN t2 ON 1=1
, where1=1
is used to evaluate to TRUE for every row.Eventually
allow.cartesian
could be set toTRUE
whenon=TRUE
detected so the use case would look like:And corresponding SQLite
Update https://stackoverflow.com/questions/25888706/r-data-table-cross-join-not-working when solved.
The text was updated successfully, but these errors were encountered: