Different merge types results in the same logic #14

seanoconnor23 · 2024-02-21T13:37:09Z

Thanks for such a great repository! I just was going through the code where you're able to implement run merge() on two data frames and it looks as if the logic is the same for all three merges (1:m or m:1 or 1:1):

    ## Add to index depending on merge type
    if merge_type == "1:m":
        index.add(embeddings2)
    elif merge_type == "m:1":
        index.add(embeddings2)
    elif merge_type == "1:1":
        index.add(embeddings2)

    ## Search index
    if merge_type == "1:m":
        D, I = index.search(embeddings1, 1)
    elif merge_type == "m:1":
        D, I = index.search(embeddings1, 1)
    elif merge_type == "1:1":
        D, I = index.search(embeddings1, 1)

Could someone explain why this is the case, please? Thanks! 😄

The text was updated successfully, but these errors were encountered:

econabhishek · 2024-02-21T16:23:10Z

Thanks! We are thinking about deprecating this. This had different logic before, but we reverted to the same one because fuzzy merge is slightly different from a standard merge.
For now, we treat the "right" columns as the "corpus" in retrieval terminology and "left" columns as "queries".
We believe that with most use cases of this tool, it is highly unlikely 1:1 is ever going to be used because of the very fuzzy nature of the task. In order to keep the API simple, we got rid of the logic in 1:m and m:1 which are standard in exact matching in statistical packages. In most at-scale tasks, we have no idea if there are any fuzzy duplicates in either the left and right columns - so even m:m is supported (though anyone who does record linkage would recommend against it)

However, we are open to suggestions, and if you have a case for keeping these and adjusting the logic appropriately, we are happy to revisit this. 1:m and m:1 would just be symmetric and 1:1 is a special case of 1:m.

Also, fun side note, python's standard merge has an option called "Validate" where these options can be specified. These only check for duplication of keys- it has no bearing on the match itself - we can probably rename this option as validate as well.

seanoconnor23 · 2024-02-21T17:29:42Z

Interesting, thanks for the explanation and getting back to me so quickly! I'll describe the type of datasets I am using and how the 1:m would be really helpful for me.

In dataset one I have about 30,000 users each with a unique id assigned to them.
In dataset two I have about 50,000 users each with a unique id assigned to them.

The problem I'm running into is I need to map one id from dataset one to multiple ids in dataset two as in dataset two the company can change an individuals id more than once. The reason I need 1:m is I want to be able to map 1 id to multiple ids so I can create a universal id for that person. However, after the merge() function the results you get back are only equal to the size of dataset one so len(dataset_one) = 30,000 or 1:1 so it's dropping a potential mapping for me.

I know I could switch the datasets around but I need to keep the 30,000 dataset as df1 and the 50,000 dataset as df2.

If you could provide the snippet I need for a 1:m mapping that would be really appreciated 🙏

Thanks once again!

econabhishek · 2024-02-21T18:24:25Z

Thanks! Just so I understand this clearly, what is preventing you from switching the two datasets around?
What the code is doing in words is "For each row in df1 (left columns) find a nearest neighbour from df2 (right columns)".
Hence, you only see rows = len(df1). You have multiple options within the current framework - tell me if these sound neat enough to you.

Use merge_knn() and use a generous k (like 10) - this will return k nearest neighbors of left rows from the right df. Then, you can choose a score threshold which makes sense (like >0.8 score has reasonable matches) and filter the matches. This should be very close to what you would need. K can be very large - but we recommend keeping it as low as possible for speed. Meta's FAISS library (which forms our retrieval backbone) recommends it to be less than 900. An example is in this notebook
Switch the dataframes around and use merge(). This time it will be equal to len(50,000). You can again check which score makes sense (score is symmetric as we are using cosine similarity) and drop matches <threshold similarity. I am not sure why you don't want to switch this around.

If you think about this a bit more conceptually, when we want a 1:m fuzzy merge we think : "I want to have several suitable merges for each row in the df1 from df2". This goal can be achieved by either 1 or 2 - does this make sense?

econabhishek · 2024-02-28T18:12:55Z

Closing due to inactivity; solutions are provided above.

econabhishek closed this as completed Feb 28, 2024

econabhishek mentioned this issue Apr 28, 2024

Inconsistent merge type comment #16

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different merge types results in the same logic #14

Different merge types results in the same logic #14

seanoconnor23 commented Feb 21, 2024

econabhishek commented Feb 21, 2024

seanoconnor23 commented Feb 21, 2024

econabhishek commented Feb 21, 2024

econabhishek commented Feb 28, 2024

Different merge types results in the same logic #14

Different merge types results in the same logic #14

Comments

seanoconnor23 commented Feb 21, 2024

econabhishek commented Feb 21, 2024

seanoconnor23 commented Feb 21, 2024

econabhishek commented Feb 21, 2024

econabhishek commented Feb 28, 2024