Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fasttext ft_hash and unicode handling #2059

Closed
leezu opened this issue May 23, 2018 · 10 comments · Fixed by #2313
Closed

fasttext ft_hash and unicode handling #2059

leezu opened this issue May 23, 2018 · 10 comments · Fixed by #2313
Assignees
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills fasttext Issues related to the FastText model

Comments

@leezu
Copy link

leezu commented May 23, 2018

Fasttext uses the hashing trick to map ngrams to a an index in [0, N]. Gensim supports loading models trained with original fasttext implementation from facebook research. It is therefore important that both gensims and the original implementation use the same hash function to make sure that ngrams are associated with the correct vectors.

The original implementation is in C++ and considers ngrams as std::basic_string<char>, ie. a sequence of bytes:

uint32_t Dictionary::hash(const std::string& str) const {
  uint32_t h = 2166136261;
  for (size_t i = 0; i < str.size(); i++) {
    h = h ^ uint32_t(str[i]);
    h = h * 16777619;
  }
  return h;
}

However the gensim python implementation consider a ngram as a sequence of unicode characters:

        h = np.uint32(2166136261)
        for c in string:
            h = h ^ np.uint32(ord(c))
            h = h * np.uint32(16777619)
        return h

The two are not equivalent. Consider:

In [1]: import numpy as np

In [2]: def hash_unicode(s):
   ...:     h = np.uint32(2166136261)
   ...:     for c in s:
   ...:         h = h ^ np.uint32(ord(c))
   ...:         h = h * np.uint32(16777619)
   ...:     return h
   ...:
   ...:
   ...: def hash_bytes(s):
   ...:     h = np.uint32(2166136261)
   ...:     s = s.encode('utf-8')
   ...:     for c in s:
   ...:         h = h ^ np.uint32(c)
   ...:         h = h * np.uint32(16777619)
   ...:     return h
   ...:
   ...:

In [3]: hash_unicode("é")
Out[3]: 1812687940

In [4]: hash_bytes("é")
Out[4]: 513665217

I am not really familiar with gensims code, so I may have overlooked something.

I assume that the cython implementation treats the iteration over a unicode string also as an iteration over unicode characters, so the same consideration as above would apply, but I haven't verified this.

Edit: I checked the inputs to the hash function of the original fasttext implementation. There, eg. <α> is represented by 3c ffffffce ffffffb1 3e

@menshikh-iv
Copy link
Contributor

Thanks for report @leezu, looks like a bug for me

@menshikh-iv menshikh-iv added bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills labels Jul 30, 2018
@piskvorky
Copy link
Owner

If this is the case, how can Fasttext models loaded in Gensim even work? Wouldn't the similarity results be basically random for OOV words?

This sounds like a critical issue to me. @jayantj @manneshiva thoughts?

@menshikh-iv
Copy link
Contributor

@piskvorky good question, probably nobody tests it with non-en models.

Anyway, need to check it.

@leezu
Copy link
Author

leezu commented Aug 9, 2018

For your reference, this was also not implemented correctly in the fastText C++ implementation. It is fixed there now. More info facebookresearch/fastText#553

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Oct 6, 2018

Hi @leezu, I see the fix in FB repo: facebookresearch/fastText@9c9b9b2

I don't catch a bit, what we should to do (input can be non-utf8 too), can you give me an advice?

UPD: A std::string is an instantiation of the std::basic_string template with a type of char, in this case

std::string t = "привет";
std::cout << t.size();   // 12 (not 6)

you are right, our implementation are not compatible with FB

@leezu
Copy link
Author

leezu commented Oct 7, 2018

@menshikh-iv, essentially it is necessary to make sure something along the lines of the following Python code is used (equivalent to the C++ code you linked in the previous post). I haven't looked into the gensim cython implementation, but at least the pure python version is incorrect (as it wrongly considers the characters but not the bytes composing an ngram).


def _byte_to_int(b):
    return b

if sys.version_info[0] == 2:
    def _byte_to_int(b):
        return ord(b)

word_enc = bytearray((u'<' + word + u'>').encode('utf-8'))
hashes = []
for ngram in ngrams(memoryview(word_enc)):
    h = np.uint32(2166136261)
    for c in ngram:
        h = np.uint32(h ^ np.uint32(np.int8(_byte_to_int(c))))
        h = np.uint32(h * np.uint32(16777619))
    hashes.append(h)

@aneesh-joshi
Copy link
Contributor

aneesh-joshi commented Oct 15, 2018

@leezu
I am trying to fix these issues.
What would you say is a good test condition to ensure gensim implementation is compatible with Facebook.

Does:
fb_hash("é") == gensim_hash("é")
cover all the cases?

Also:
I am not used to handling unicode strings(advanced level) so I need help in understanding:
Please tell me if I understand correctly:

The current gensim hash function hashes based on the ord aka the ascii value of the characters in the strings.
Whereas, facebook encodes based on bytes.

This is leading to different hash values for the same strings.

Ideally, shouldn't changing the gensim hash function to use unicode work?
Maybe not, because: python2 and python3 don't treat string as unicode bytes. So we have to convert the python2 to int manually as you have done in the function: _byte_to_int

Do I understand correctly?

Currently, I will try to incorporate your "psuedo" code into gensim.

@leezu
Copy link
Author

leezu commented Oct 16, 2018

@aneesh-joshi I suggest you to train a fasttext model using the C++ implementation based on a small test corpus containing unicode symbols (ie. a sentence or a few sentences). Then use again the fasttext C++ code to print out the vectors for all words. Store these vectors. The test case should be, given the binary model, match the vectors computed by the C++ implementation exactly.

The hash function must operate on bytes of unicode text. One unicode character may consist of multiple bytes. This is trivial in C++ but a bit involved in Python due to the difference in Py2 and Py3 and that Python in general abstracts the bytes away and simply handles characters.

You can also check the GluonNLP implementation: https:/dmlc/gluon-nlp/blob/f9be6cd2c3780b3c7e11a1aca189bf8129bc0c0d/gluonnlp/vocab/subwords.py#L171-L275
The pseudo-code above is essentially a simplified version of the implementation in GluonNLP.

@aneesh-joshi
Copy link
Contributor

aneesh-joshi commented Oct 23, 2018

@leezu @menshikh-iv @piskvorky
Update: I was able to reproduce error and fix it:

To reproduce:

I trained C++ Fasttext on a simple .txt file which had unicode and non-unicode characters. This is the file:

This file contains some unicode sentences like привет as provided by Ivan Menshikh. I will also include some others like : é é and ααααααα
αα α ααα α
Let's hope this works.

I ran:

./fasttext skipgram -minCount 0 -input my_test_file.txt -output my_model
# This created a `my_model.bin` and `my_model.vec` file
cat my_model.vec
.
.
.
ααααααα -0.0030363 -0.0027762 . . . -0.0018107 0.001506
.
.

Then:

from gensim.models import FastText
model = FastText.load_fasttext_format("my_model.bin")
print(mode.wv["ααααααα"])
array([ 1.9879919e-03,  1.7319367e-03,  6.5513700e-04,  1.2212876e-03,
         .
         .
       -2.0760519e-03,  9.5942023e-04,  1.8221042e-04,  2.6060001e-04],
      dtype=float32)

This is clearly different from the original FT vector. It gives the same result for non-unicode strings.

To fix

I checked out to my branch with @leezu 's suggested changes
Ran the same commands as above.

array([-3.0362648e-03, -2.7762062e-03, -1.8186163e-03,  8.1895100e-04,
       .
       .
        9.9862483e-04, -1.6497656e-03, -1.8107096e-03,  1.5059930e-03],
      dtype=float32)

This gives the same result.

Current problem:

My test case is hanging/going into an infinite loop on some case. I will investigate more.
I also will check if this works for py2

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Oct 23, 2018

Nice progress @aneesh-joshi 👍

please test it fully as possible

  • existing word-vector
  • only non-unicode
  • only unicode
  • mix up unicode with non-unicode

also, I'm worried that we possibly generate ngrams incorrectly (by the same reason), please check this too.
And last thing - have a look, how this affects to performance (comparing to current wrong variant, just for know), maybe better to encode the full string, instead of independent ngrams.

keep me updated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills fasttext Issues related to the FastText model
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants