Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FR: if CJ is passed data.table arguments, do a blocked cross-join #2343

Closed
MichaelChirico opened this issue Sep 8, 2017 · 3 comments
Closed

Comments

@MichaelChirico
Copy link
Member

MichaelChirico commented Sep 8, 2017

I have a case where I don't want the outer product of all rows of some input, but rather the outer product of all blocks of rows of the input. It seems natural that CJ should be able to handle constructing this:

DT1 = data.table(x1 = c(1, 2), x2 = c(3, 4))
DT2 = data.table(y1 = c(5, 6, 7))

Desired output:

CJ(DT1, DT2)
#    x1 x2 y1
# 1:  1  3  5
# 2:  2  4  5
# 3:  1  3  6
# 4:  2  4  6
# 5:  1  3  7
# 6:  2  4  7

Hopefully it's sufficiently clear from this.

A hack is to do something like:

idxDT = CJ(seq_len(nrow(DT1)), seq_len(nrow(DT2)))
idxDT[ , cbind(DT1[V1], DT2[V2])]
#    x1 x2 y1
# 1:  1  3  5
# 2:  1  3  6
# 3:  1  3  7
# 4:  2  4  5
# 5:  2  4  6
# 6:  2  4  7

The order isn't particularly natural here, but doesn't matter in my application. Worse is that it's clunky and not easily extensible to having several more data.tables of input.

Most natural in current functionality (wrong) is CJ(DT1$x1, DT1$x2, DT2$y1), but this has too many rows and must be pared back.

@st-pasha
Copy link
Contributor

st-pasha commented Sep 8, 2017

How about this:

> DT2[, (DT1), by=y1]
   y1 x1 x2
1:  5  1  3
2:  5  2  4
3:  6  1  3
4:  6  2  4
5:  7  1  3
6:  7  2  4

@franknarf1
Copy link
Contributor

franknarf1 commented Sep 8, 2017

I have a case where I don't want the outer product of all rows of some input, but rather the outer product of all blocks of rows of the input.

I don't see the distinction here. There are two rows in one table; three in the other; and the Cartesian product in the result (regarding rows as tuples and tables as sets of tuples), unless you just mean the row order.

Worse is that it's clunky and not easily extensible to having several more data.tables of input.

With Reduce...

CJDT = function(...) 
  Reduce(function(DT1, DT2) cbind(DT1, DT2[rep(1:.N, each=nrow(DT1))]), list(...))

Not sure if that's what you're after. I'm not crazy about the row ordering, so I'd probably do cbind(DT1[rep(1:.N, each=nrow(DT2))], DT2) instead, fwiw.

Btw, I guess this is related to Jan's CJ.dt #1717

@MichaelChirico
Copy link
Member Author

Indeed I think #1717 covers this. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants