Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different merge types results in the same logic #14

Closed
seanoconnor23 opened this issue Feb 21, 2024 · 4 comments
Closed

Different merge types results in the same logic #14

seanoconnor23 opened this issue Feb 21, 2024 · 4 comments

Comments

@seanoconnor23
Copy link

Thanks for such a great repository! I just was going through the code where you're able to implement run merge() on two data frames and it looks as if the logic is the same for all three merges (1:m or m:1 or 1:1):

    ## Add to index depending on merge type
    if merge_type == "1:m":
        index.add(embeddings2)
    elif merge_type == "m:1":
        index.add(embeddings2)
    elif merge_type == "1:1":
        index.add(embeddings2)

    ## Search index
    if merge_type == "1:m":
        D, I = index.search(embeddings1, 1)
    elif merge_type == "m:1":
        D, I = index.search(embeddings1, 1)
    elif merge_type == "1:1":
        D, I = index.search(embeddings1, 1)

Could someone explain why this is the case, please? Thanks! 😄

@econabhishek
Copy link
Collaborator

Thanks! We are thinking about deprecating this. This had different logic before, but we reverted to the same one because fuzzy merge is slightly different from a standard merge.
For now, we treat the "right" columns as the "corpus" in retrieval terminology and "left" columns as "queries".
We believe that with most use cases of this tool, it is highly unlikely 1:1 is ever going to be used because of the very fuzzy nature of the task. In order to keep the API simple, we got rid of the logic in 1:m and m:1 which are standard in exact matching in statistical packages. In most at-scale tasks, we have no idea if there are any fuzzy duplicates in either the left and right columns - so even m:m is supported (though anyone who does record linkage would recommend against it)

However, we are open to suggestions, and if you have a case for keeping these and adjusting the logic appropriately, we are happy to revisit this. 1:m and m:1 would just be symmetric and 1:1 is a special case of 1:m.

Also, fun side note, python's standard merge has an option called "Validate" where these options can be specified. These only check for duplication of keys- it has no bearing on the match itself - we can probably rename this option as validate as well.

@seanoconnor23
Copy link
Author

Interesting, thanks for the explanation and getting back to me so quickly! I'll describe the type of datasets I am using and how the 1:m would be really helpful for me.

In dataset one I have about 30,000 users each with a unique id assigned to them.
In dataset two I have about 50,000 users each with a unique id assigned to them.

The problem I'm running into is I need to map one id from dataset one to multiple ids in dataset two as in dataset two the company can change an individuals id more than once. The reason I need 1:m is I want to be able to map 1 id to multiple ids so I can create a universal id for that person. However, after the merge() function the results you get back are only equal to the size of dataset one so len(dataset_one) = 30,000 or 1:1 so it's dropping a potential mapping for me.

I know I could switch the datasets around but I need to keep the 30,000 dataset as df1 and the 50,000 dataset as df2.

If you could provide the snippet I need for a 1:m mapping that would be really appreciated 🙏

Thanks once again!

@econabhishek
Copy link
Collaborator

Thanks! Just so I understand this clearly, what is preventing you from switching the two datasets around?
What the code is doing in words is "For each row in df1 (left columns) find a nearest neighbour from df2 (right columns)".
Hence, you only see rows = len(df1). You have multiple options within the current framework - tell me if these sound neat enough to you.

  1. Use merge_knn() and use a generous k (like 10) - this will return k nearest neighbors of left rows from the right df. Then, you can choose a score threshold which makes sense (like >0.8 score has reasonable matches) and filter the matches. This should be very close to what you would need. K can be very large - but we recommend keeping it as low as possible for speed. Meta's FAISS library (which forms our retrieval backbone) recommends it to be less than 900. An example is in this notebook

  2. Switch the dataframes around and use merge(). This time it will be equal to len(50,000). You can again check which score makes sense (score is symmetric as we are using cosine similarity) and drop matches <threshold similarity. I am not sure why you don't want to switch this around.

If you think about this a bit more conceptually, when we want a 1:m fuzzy merge we think : "I want to have several suitable merges for each row in the df1 from df2". This goal can be achieved by either 1 or 2 - does this make sense?

@econabhishek
Copy link
Collaborator

Closing due to inactivity; solutions are provided above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants