fasttext ft_hash and unicode handling #2059

leezu · 2018-05-23T22:51:32Z

Fasttext uses the hashing trick to map ngrams to a an index in [0, N]. Gensim supports loading models trained with original fasttext implementation from facebook research. It is therefore important that both gensims and the original implementation use the same hash function to make sure that ngrams are associated with the correct vectors.

The original implementation is in C++ and considers ngrams as std::basic_string<char>, ie. a sequence of bytes:

uint32_t Dictionary::hash(const std::string& str) const {
  uint32_t h = 2166136261;
  for (size_t i = 0; i < str.size(); i++) {
    h = h ^ uint32_t(str[i]);
    h = h * 16777619;
  }
  return h;
}

However the gensim python implementation consider a ngram as a sequence of unicode characters:

        h = np.uint32(2166136261)
        for c in string:
            h = h ^ np.uint32(ord(c))
            h = h * np.uint32(16777619)
        return h

The two are not equivalent. Consider:

In [1]: import numpy as np

In [2]: def hash_unicode(s):
   ...:     h = np.uint32(2166136261)
   ...:     for c in s:
   ...:         h = h ^ np.uint32(ord(c))
   ...:         h = h * np.uint32(16777619)
   ...:     return h
   ...:
   ...:
   ...: def hash_bytes(s):
   ...:     h = np.uint32(2166136261)
   ...:     s = s.encode('utf-8')
   ...:     for c in s:
   ...:         h = h ^ np.uint32(c)
   ...:         h = h * np.uint32(16777619)
   ...:     return h
   ...:
   ...:

In [3]: hash_unicode("é")
Out[3]: 1812687940

In [4]: hash_bytes("é")
Out[4]: 513665217

I am not really familiar with gensims code, so I may have overlooked something.

I assume that the cython implementation treats the iteration over a unicode string also as an iteration over unicode characters, so the same consideration as above would apply, but I haven't verified this.

Edit: I checked the inputs to the hash function of the original fasttext implementation. There, eg. <α> is represented by 3c ffffffce ffffffb1 3e

The text was updated successfully, but these errors were encountered:

menshikh-iv · 2018-07-30T11:29:26Z

Thanks for report @leezu, looks like a bug for me

piskvorky · 2018-08-04T05:57:40Z

If this is the case, how can Fasttext models loaded in Gensim even work? Wouldn't the similarity results be basically random for OOV words?

This sounds like a critical issue to me. @jayantj @manneshiva thoughts?

menshikh-iv · 2018-08-07T15:14:34Z

@piskvorky good question, probably nobody tests it with non-en models.

Anyway, need to check it.

leezu · 2018-08-09T18:34:33Z

For your reference, this was also not implemented correctly in the fastText C++ implementation. It is fixed there now. More info facebookresearch/fastText#553

menshikh-iv · 2018-10-06T16:16:54Z

Hi @leezu, I see the fix in FB repo: facebookresearch/fastText@9c9b9b2

I don't catch a bit, what we should to do (input can be non-utf8 too), can you give me an advice?

UPD: A std::string is an instantiation of the std::basic_string template with a type of char, in this case

std::string t = "привет";
std::cout << t.size();   // 12 (not 6)

you are right, our implementation are not compatible with FB

leezu · 2018-10-07T05:17:09Z

@menshikh-iv, essentially it is necessary to make sure something along the lines of the following Python code is used (equivalent to the C++ code you linked in the previous post). I haven't looked into the gensim cython implementation, but at least the pure python version is incorrect (as it wrongly considers the characters but not the bytes composing an ngram).


def _byte_to_int(b):
    return b

if sys.version_info[0] == 2:
    def _byte_to_int(b):
        return ord(b)

word_enc = bytearray((u'<' + word + u'>').encode('utf-8'))
hashes = []
for ngram in ngrams(memoryview(word_enc)):
    h = np.uint32(2166136261)
    for c in ngram:
        h = np.uint32(h ^ np.uint32(np.int8(_byte_to_int(c))))
        h = np.uint32(h * np.uint32(16777619))
    hashes.append(h)

aneesh-joshi · 2018-10-15T14:32:01Z

@leezu
I am trying to fix these issues.
What would you say is a good test condition to ensure gensim implementation is compatible with Facebook.

Does:
fb_hash("é") == gensim_hash("é")
cover all the cases?

Also:
I am not used to handling unicode strings(advanced level) so I need help in understanding:
Please tell me if I understand correctly:

The current gensim hash function hashes based on the ord aka the ascii value of the characters in the strings.
Whereas, facebook encodes based on bytes.

This is leading to different hash values for the same strings.

Ideally, shouldn't changing the gensim hash function to use unicode work?
Maybe not, because: python2 and python3 don't treat string as unicode bytes. So we have to convert the python2 to int manually as you have done in the function: _byte_to_int

Do I understand correctly?

Currently, I will try to incorporate your "psuedo" code into gensim.

leezu · 2018-10-16T05:24:29Z

@aneesh-joshi I suggest you to train a fasttext model using the C++ implementation based on a small test corpus containing unicode symbols (ie. a sentence or a few sentences). Then use again the fasttext C++ code to print out the vectors for all words. Store these vectors. The test case should be, given the binary model, match the vectors computed by the C++ implementation exactly.

The hash function must operate on bytes of unicode text. One unicode character may consist of multiple bytes. This is trivial in C++ but a bit involved in Python due to the difference in Py2 and Py3 and that Python in general abstracts the bytes away and simply handles characters.

You can also check the GluonNLP implementation: https:/dmlc/gluon-nlp/blob/f9be6cd2c3780b3c7e11a1aca189bf8129bc0c0d/gluonnlp/vocab/subwords.py#L171-L275
The pseudo-code above is essentially a simplified version of the implementation in GluonNLP.

aneesh-joshi · 2018-10-23T15:41:37Z

@leezu @menshikh-iv @piskvorky
Update: I was able to reproduce error and fix it:

To reproduce:

I trained C++ Fasttext on a simple .txt file which had unicode and non-unicode characters. This is the file:

This file contains some unicode sentences like привет as provided by Ivan Menshikh. I will also include some others like : é é and ααααααα
αα α ααα α
Let's hope this works.

I ran:

./fasttext skipgram -minCount 0 -input my_test_file.txt -output my_model
# This created a `my_model.bin` and `my_model.vec` file
cat my_model.vec
.
.
.
ααααααα -0.0030363 -0.0027762 . . . -0.0018107 0.001506
.
.

Then:

from gensim.models import FastText
model = FastText.load_fasttext_format("my_model.bin")
print(mode.wv["ααααααα"])
array([ 1.9879919e-03,  1.7319367e-03,  6.5513700e-04,  1.2212876e-03,
         .
         .
       -2.0760519e-03,  9.5942023e-04,  1.8221042e-04,  2.6060001e-04],
      dtype=float32)

This is clearly different from the original FT vector. It gives the same result for non-unicode strings.

To fix

I checked out to my branch with @leezu 's suggested changes
Ran the same commands as above.

array([-3.0362648e-03, -2.7762062e-03, -1.8186163e-03,  8.1895100e-04,
       .
       .
        9.9862483e-04, -1.6497656e-03, -1.8107096e-03,  1.5059930e-03],
      dtype=float32)

This gives the same result.

Current problem:

My test case is hanging/going into an infinite loop on some case. I will investigate more.
I also will check if this works for py2

menshikh-iv · 2018-10-23T16:09:02Z

Nice progress @aneesh-joshi 👍

please test it fully as possible

existing word-vector
only non-unicode
only unicode
mix up unicode with non-unicode

also, I'm worried that we possibly generate ngrams incorrectly (by the same reason), please check this too.
And last thing - have a look, how this affects to performance (comparing to current wrong variant, just for know), maybe better to encode the full string, instead of independent ngrams.

keep me updated!

piskvorky assigned manneshiva May 24, 2018

leezu unassigned manneshiva Jun 19, 2018

menshikh-iv added bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills labels Jul 30, 2018

leezu mentioned this issue Aug 28, 2018

FastText native VS original, different outputs #1940

Closed

This was referenced Oct 5, 2018

[WIP] Fix FastText #2215

Closed

Use itertools.chain instead of sum to concatenate lists #2212

Merged

aneesh-joshi mentioned this issue Oct 16, 2018

[DNM] Fix FastText hash function incompatibility #2059 #2233

Closed

This was referenced Dec 8, 2018

How to get vector of word out of vocabulary with python from model trained by fastText? facebookresearch/fastText#458

Closed

FastText.wv.word_vec word vectors with Polish character are different than the Facebook's #2276

Closed

menshikh-iv assigned mpenkov Dec 14, 2018

mpenkov added the fasttext Issues related to the FastText model label Dec 15, 2018

mpenkov mentioned this issue Jan 7, 2019

Fix critical issues in FastText #2313

Merged

menshikh-iv closed this as completed in #2313 Jan 11, 2019

menshikh-iv mentioned this issue Jan 24, 2019

Add Sent2Vec model. Fix #1376 #1619

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fasttext ft_hash and unicode handling #2059

fasttext ft_hash and unicode handling #2059

leezu commented May 23, 2018 •

edited

Loading

menshikh-iv commented Jul 30, 2018

piskvorky commented Aug 4, 2018

menshikh-iv commented Aug 7, 2018

leezu commented Aug 9, 2018

menshikh-iv commented Oct 6, 2018 •

edited

Loading

leezu commented Oct 7, 2018

aneesh-joshi commented Oct 15, 2018 •

edited

Loading

leezu commented Oct 16, 2018

aneesh-joshi commented Oct 23, 2018 •

edited

Loading

menshikh-iv commented Oct 23, 2018 •

edited

Loading

fasttext ft_hash and unicode handling #2059

fasttext ft_hash and unicode handling #2059

Comments

leezu commented May 23, 2018 • edited Loading

menshikh-iv commented Jul 30, 2018

piskvorky commented Aug 4, 2018

menshikh-iv commented Aug 7, 2018

leezu commented Aug 9, 2018

menshikh-iv commented Oct 6, 2018 • edited Loading

leezu commented Oct 7, 2018

aneesh-joshi commented Oct 15, 2018 • edited Loading

leezu commented Oct 16, 2018

aneesh-joshi commented Oct 23, 2018 • edited Loading

To reproduce:

To fix

Current problem:

menshikh-iv commented Oct 23, 2018 • edited Loading

leezu commented May 23, 2018 •

edited

Loading

menshikh-iv commented Oct 6, 2018 •

edited

Loading

aneesh-joshi commented Oct 15, 2018 •

edited

Loading

aneesh-joshi commented Oct 23, 2018 •

edited

Loading

menshikh-iv commented Oct 23, 2018 •

edited

Loading