-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fasttext ft_hash and unicode handling #2059
Comments
Thanks for report @leezu, looks like a bug for me |
If this is the case, how can Fasttext models loaded in Gensim even work? Wouldn't the similarity results be basically random for OOV words? This sounds like a critical issue to me. @jayantj @manneshiva thoughts? |
@piskvorky good question, probably nobody tests it with non-en models. Anyway, need to check it. |
For your reference, this was also not implemented correctly in the fastText C++ implementation. It is fixed there now. More info facebookresearch/fastText#553 |
Hi @leezu, I see the fix in FB repo: facebookresearch/fastText@9c9b9b2 I don't catch a bit, what we should to do (input can be non-utf8 too), can you give me an advice? UPD: std::string t = "привет";
std::cout << t.size(); // 12 (not 6) you are right, our implementation are not compatible with FB |
@menshikh-iv, essentially it is necessary to make sure something along the lines of the following Python code is used (equivalent to the C++ code you linked in the previous post). I haven't looked into the gensim cython implementation, but at least the pure python version is incorrect (as it wrongly considers the characters but not the bytes composing an ngram).
|
@leezu Does: Also: The current gensim hash function hashes based on the This is leading to different hash values for the same strings. Ideally, shouldn't changing the gensim hash function to use unicode work? Do I understand correctly? Currently, I will try to incorporate your "psuedo" code into gensim. |
@aneesh-joshi I suggest you to train a fasttext model using the C++ implementation based on a small test corpus containing unicode symbols (ie. a sentence or a few sentences). Then use again the fasttext C++ code to print out the vectors for all words. Store these vectors. The test case should be, given the binary model, match the vectors computed by the C++ implementation exactly. The hash function must operate on bytes of unicode text. One unicode character may consist of multiple bytes. This is trivial in C++ but a bit involved in Python due to the difference in Py2 and Py3 and that Python in general abstracts the bytes away and simply handles characters. You can also check the GluonNLP implementation: https:/dmlc/gluon-nlp/blob/f9be6cd2c3780b3c7e11a1aca189bf8129bc0c0d/gluonnlp/vocab/subwords.py#L171-L275 |
@leezu @menshikh-iv @piskvorky To reproduce:I trained C++ Fasttext on a simple .txt file which had unicode and non-unicode characters. This is the file:
I ran:
Then:
This is clearly different from the original FT vector. It gives the same result for non-unicode strings. To fixI checked out to my branch with @leezu 's suggested changes
This gives the same result. Current problem:My test case is hanging/going into an infinite loop on some case. I will investigate more. |
Nice progress @aneesh-joshi 👍 please test it fully as possible
also, I'm worried that we possibly generate ngrams incorrectly (by the same reason), please check this too. keep me updated! |
Fasttext uses the hashing trick to map ngrams to a an index in [0, N]. Gensim supports loading models trained with original fasttext implementation from facebook research. It is therefore important that both gensims and the original implementation use the same hash function to make sure that ngrams are associated with the correct vectors.
The original implementation is in C++ and considers ngrams as
std::basic_string<char>
, ie. a sequence of bytes:However the gensim python implementation consider a ngram as a sequence of unicode characters:
The two are not equivalent. Consider:
I am not really familiar with gensims code, so I may have overlooked something.
I assume that the cython implementation treats the iteration over a
unicode
string also as an iteration over unicode characters, so the same consideration as above would apply, but I haven't verified this.Edit: I checked the inputs to the hash function of the original fasttext implementation. There, eg.
<α>
is represented by3c ffffffce ffffffb1 3e
The text was updated successfully, but these errors were encountered: