Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

join vignette #2181

Open
lilchow opened this issue May 27, 2017 · 46 comments
Open

join vignette #2181

lilchow opened this issue May 27, 2017 · 46 comments
Labels
documentation joins non-equi joins rolling, overlapping, non-equi joins top request One of our most-requested issues
Milestone

Comments

@lilchow
Copy link

lilchow commented May 27, 2017

Several places in the available vignettes refer to this mysterious vignette about join and rolling join. When will it be up? Thanks

@franknarf1
Copy link
Contributor

See #944 . There are many ideas for vignettes there.

If you want some examples, you could see stackoverflow.com or maybe my notes.

@arunsrinivasan

This comment has been minimized.

@batcheneden

This comment has been minimized.

@MichaelChirico

This comment has been minimized.

@batcheneden

This comment has been minimized.

@jangorecki

This comment has been minimized.

@MichaelChirico
Copy link
Member

Noting for reference: joins vignette would be a good place to have an example of replacing nested ifelse with a join.

@jangorecki
Copy link
Member

jangorecki commented Apr 10, 2020

summarizing the scope

@jangorecki jangorecki mentioned this issue Apr 24, 2020
33 tasks
@zeomal

This comment was marked as off-topic.

@zeomal
Copy link

zeomal commented Apr 24, 2020

@jangorecki, given that #3453 is being prepared where a detailed overview of rolling joins is being covered by @Henrik-P, would it make sense to add separate vignettes for equi- and non equi- joins, as I believe the latter is far more relevant for time series analysis? The content of both vignettes at the moment will be significant given your scope above.

@jangorecki
Copy link
Member

For Joins vignette:

#2396

Originally posted by @MichaelChirico in #944 (comment)

@jangorecki
Copy link
Member

jangorecki commented Apr 24, 2020

@zeomal better to have 2 bigger vignettes, than 3 smaller IMO. We already have many vignettes.

@jangorecki jangorecki changed the title Join and rolling join vignette is MIA join vignette Apr 24, 2020
@zeomal
Copy link

zeomal commented Apr 25, 2020

@jangorecki, I've created a draft pull request for this vignette. It's a first version, bound to have many changes, but covers the basics of equi-joins. This is my first pull request ever, so if I've done something wrong, please correct me.

@jangorecki
Copy link
Member

jangorecki commented Jul 20, 2022

By "documentation" @avimallu meant package manual. Aside from what @avimallu mentioned I could add stackoverflow as well.
I am pretty sure join vignette will be added at some point. We not only want to document joins there, there is a manual for that already, but we would like to have a nice guided story that goes through different join scenarios that still will be easy to grasp and not overwhelming. Writing such a good vignette is not that simple.

@MichaelChirico
Copy link
Member

Agree with Jan -- it's one thing to have simple snippets in Rd pages, constructing a coherent narrative with compelling (and publicly available!) data is a different beast. Contributions definitely welcome here.

If there's any piece you think is missing (edge cases, certain parameters/their interaction, etc) from the Rd docs, please flag that and it'll be easier to fix.

@JaimeArboleda
Copy link

Hello!! I don't know if this is the most recent issue commenting this missing vignette. But I landed on data.table a couple of days ago and I wanted to learn the framework thoroughly following the vignettes in order. I was sad finding out that this vignette does not exist yet.

By the way, I wanted to thank you all for this package. I come from Python (I am very used to pandas), and lately I started using R and I was enjoying the tidyverse approach (which is the most usually taught). But when I started with this library I was sooooooo blown away by it. I mean, it's just amazingly good: even if it were not the fastest library ever, I would be using it just for the clever syntax. I cannot emphasize more how much I love it!!! So thanks a lot for this wonderful creation.

If I could be of any help with this vignette that would be great. Is there anyone working on it? Is there any branch with a draft of the vignette? I have no idea...

@jangorecki
Copy link
Member

jangorecki commented Sep 27, 2022

Thank you for warm words. My impression about DT was quite similar when I arrived to it :) top speed and low memory are just nice bonus to the best syntax.

As for learning joins, you can go through the list of join features mentioned in this issue, and look it up in ?data.table manual and stackoverflow. There was a draft of join vignette, or maybe even two, but they were far from complete, so I doubt if the one will succeed as vignette ultimately.

@JaimeArboleda
Copy link

Yeah, but to be honest it took me some time to decide to invest in it, because my wrong impression, created by many shared opinions in blogs and discussions forums, was that the syntax was ugly and difficult to understand. And things like the mere existence of tidytable reinforce this idea (somewhat recognizing that data.table can be improved wity tidy syntax).

I mean, I think it's good that both syntax approaches exist (specially, being so orthogonal), and that different people can use R they way they prefer. The only think that makes me sad is that I feel data.table is underpromoted and has an undeserved aura of obscurity. At least, that was my perception.

Thanks a lot for your suggestion. I will start with your approach and hopefully I will be able to understand it.

@tdhock
Copy link
Member

tdhock commented Feb 28, 2023

+1 I found it confusing that this vignette is mentioned in datatable-intro, but can not be found/read. Is there another reference that we can use for teaching people how to do joins?

@jangorecki jangorecki modified the milestones: 1.14.11, 1.15.1 Oct 29, 2023
@AngelFelizR
Copy link
Contributor

AngelFelizR commented Nov 7, 2023

@jangorecki, Could I use the Taylor Swift Tidytuesday dataset to create the vignette?

I can explain what I learnt in the Joining Data with data.table in R Datacamp course

@waynelapierre
Copy link

Cannot believe that this issue has been around for so long. This is actually a bug. It is better not to mention it at all.

@jangorecki
Copy link
Member

jangorecki commented Nov 8, 2023

There are already drafts or work in progress of this vignette, IIRC 2 or even 3, so probably it will be good place to start from rather than adding the next one.

@avimallu
Copy link
Contributor

avimallu commented Nov 8, 2023

@AngelFelizR, take a look at #4398 for inspiration and which issues the join vignette could close.

@AngelFelizR
Copy link
Contributor

Thanks @avimallu, I will work to have a first draft by 2023-11-27

@dvg-p4
Copy link
Contributor

dvg-p4 commented Nov 9, 2023

I've found https://medium.com/analytics-vidhya/r-data-table-joins-48f00b46ce29 to be quite helpful as well. Though since that's on a personal blog you'll probably want to contact the author for permission if you wanted to copy from it for the vignette.

@AngelFelizR
Copy link
Contributor

After reading all the comments related to this issue, I found out that the vignette must be created with simulated data. This approach will demonstrate how to use the package in various situations, from using a short data.table of 5 rows to avoiding unnecessary dependencies. It’s important to keep the story from becoming overwhelming.

Here is the basic structure that I will be creating:

  1. merge function
  • Inner join
  • Right join
  • Replacing nested ifelse with a join
  • Left join
  • Full join
  • Different col names
  • Many to many join (allow.cartesian)
  1. data.table syntax joins
  • Right join
  • Keyed joins
  • Natural join
  • Update x on join
  • Aggregate on join (by = .EACHI or by = x's columns)
  • Editing x based on i matching columns by x groups ()
  • i. and x. j's prefixes
  • Inner join
  • Not join
  • Many to many join
    • allow.cartesian
    • mult
  • Non-equi join (>=, >, <=, <)
  • Rolling join
  • Semi join
  • Cross join
  1. Merging many tables (Reduce(merge, list(DT1,DT2,DT3,...)))

Please let me know if I am missing something.

@tdhock
Copy link
Member

tdhock commented Nov 28, 2023

I used the flights data to explain joins, in my slides for the data.table tutorial at the LatinR meeting last month, https:/tdhock/2023-10-LatinR-data.table#english

@jangorecki
Copy link
Member

jangorecki commented Nov 28, 2023

mergelist PR is ready to merge so probably will land in master before the vignette, so should be included as well

foverlaps is missing

@avimallu
Copy link
Contributor

I think we should avoid the merge function other than as a side note. One of data.table's strengths is its merge syntax, and that is what should be front, right and centre.

In addition, the overlap join functions have a separate syntax, it might be worth placing all syntactically similar joins together to have them all in one place.

@AngelFelizR
Copy link
Contributor

AngelFelizR commented Nov 28, 2023

@avimallu

I think we should avoid the merge function other than as a side note. One of data.table's strengths is its merge syntax, and that is what should be front, right and centre.

I started the vignette with the merge function as is easier to understand for new users. In my case is normal to use many merge function in chain following the next syntax as is the only way to apply left join.

DtMerged <-
DT1[...
][, merge(.SD, DT2, by = "x", all.x = TRUE)
][, merge(.SD, DT3, by = "y", all.x = TRUE)]

What I could do is to move the mergelist from point 3 to point 2 to avoid the switch from function to syntax.

In addition, the overlap join functions have a separate syntax, it might be worth placing all syntactically similar joins together to have them all in one place.

I thought that overlap join is an application of non-equi join.

@jangorecki
Copy link
Member

overlapping join (foverlaps api) is a special case of non-equi join ([.data.table api).

I second @avimallu suggestion about dropping merge. Mentioning it and linking manual sounds good, but it is sub-optimal in performance and in features so we should try not to onboard users into it. mergelist is good substitute.

@AngelFelizR
Copy link
Contributor

AngelFelizR commented Nov 28, 2023

Here is the new structure

  1. data.table syntax joins
  • Right join
  • Left join (custom function)
  • Keyed joins
  • Natural join
  • Update x on join
  • Aggregate on join (by = .EACHI or by = x's columns)
  • Editing x based on i matching columns by x groups ()
  • i. and x. j's prefixes
  • Inner join
  • Not join
  • Many to many join
    • allow.cartesian
    • mult
  • Non-equi join (>=, >, <=, <)
  • Rolling join
  • Semi join
  • Cross join
  1. Joining a list of tables (mergelist)
  2. Full join (by using merge and link to documentation)

@jangorecki
Copy link
Member

nb. mergelist supports full join as well, probably much more efficient than merge

@AngelFelizR
Copy link
Contributor

nb. mergelist supports full join as well, probably much more efficient than merge

That's sounds really good.

So the will the vignette's structure:

  1. data.table syntax joins
  • Right join
  • Left join (custom function)
  • Keyed joins
  • Natural join
  • Update x on join
  • Aggregate on join (by = .EACHI or by = x's columns)
  • Editing x based on i matching columns by x groups ()
  • i. and x. j's prefixes
  • Inner join
  • Not join
  • Many to many join
    • allow.cartesian
    • mult
  • Non-equi join (>=, >, <=, <)
  • Rolling join
  • Semi join
  • Cross join
  1. Full join and joining a list of tables (mergelist)

Link to merge function documentation

@jangorecki
Copy link
Member

Moreover, mergelist has more mult options as far as I recall.

Aggregate on join is not possible by x's column yet. Only by each I.

@MichaelChirico MichaelChirico modified the milestones: 1.16.0, 1.17.0 Jul 10, 2024
rikivillalba added a commit to AngelFelizR/data.table that referenced this issue Sep 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation joins non-equi joins rolling, overlapping, non-equi joins top request One of our most-requested issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.