Skip to content

This competition involves a unique dataset from the Q&A videos of DataTalks.Club courses, challenging participants to accurately match questions with their correct answers

Notifications You must be signed in to change notification settings

gowtham07/Question-Answer-Kaggle-Competition

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 

Repository files navigation

Question-Answer-Kaggle-Competition

This competition involves a unique dataset from the Q&A videos of DataTalks.Club courses, challenging participants to accurately match questions with their correct answers

Kaggle Notebook

In the above notebook I have made use of Pandas to clean the data in CSV file, analyze and save data to dataframes. I made use of sentence transformer for this purpose. Firstly, I tried to use triplet loss for this usecase ,most popular loss functions for supervised similarity or metric learning ever since. In its simplest explanation, Triplet Loss encourages that dissimilar pairs be distant from any similar pairs by at least a certain margin value. The score for this was 0.69, where the triplet pair was (ques, ans, other probable answer but not 100% related). Triplet loss tries to align the embeddings of ques and ans near and pushes away the other probable answer.

The next step I tried was to add multiple other probable answers so for example each question has three samples with the three different other probable answers for same ques and ans. This did not learn well and its not optimal as the score was 0.5648 which is way less than using only one sample per question, answer pair.

Next, I want to make it simple by using MultipleNegativesRankingLoss which takes [(a1, b1), …, (an, bn)] where we assume that (ai, bi) are similar sentences and (ai, bj) are dissimilar sentences for i != j. The minimizes the distance between (ai, bi) while it simultaneously maximizes the distance (ai, bj) for all i != j. This is better as we do not need to specify the negative and it automatically maximizes distance where i!=j. I got an improved performance of 0.962. Still this can be improved further by adding triplets instead of pairs and the third one should be hard-negatives where on lexical level its similar to the (a1, b1) but on semantic level its not similar to (a1, b1)

KaggleNotebook for triplet loss with one sample each for question

KaggleNotebook for triplet loss with multiple samples each for question

About

This competition involves a unique dataset from the Q&A videos of DataTalks.Club courses, challenging participants to accurately match questions with their correct answers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published