Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix critical issues in FastText #2313

Merged
merged 136 commits into from
Jan 11, 2019
Merged

Fix critical issues in FastText #2313

merged 136 commits into from
Jan 11, 2019

Conversation

mpenkov
Copy link
Collaborator

@mpenkov mpenkov commented Jan 2, 2019

Current PR contains fixes for all critical bugs in our fasttext implementation:

In conclusion - this makes FastText in Gensim more reliable, and directly compatible with FB's FT implementation for OOV words and model persistence.


We also identified divergent behavior with the Facebook implementation. This behavior is caused by an optimization that uses a smaller number of buckets than available. The manifestation is that if we compare two models:

  1. Gensim trained from a text file
  2. Facebook trained from the same file (same parameters) and loaded via Gensim

then 1) will have fewer vectors than 2). As a consequence, vectors for OOV terms between the models will differ. This behavior is captured in our unit tests as test_out_of_vocab_gensim.

$ ~/src/fastText-0.1.0/fasttext cbow -input toy-data.txt -output toy-model -bucket 100
Read 0M words
Number of words:  22
Number of labels: 0
Progress: 100.0%  words/sec/thread: 209  lr: 0.000000  loss: 4.100698 eta: 0h0m -14m
this will make it easier to debug manually

$ ~/src/fastText-0.1.0/fasttext cbow -input toy-data.txt -output toy-model -bucket 100 -dim 5
Read 0M words
Number of words:  22
Number of labels: 0
Progress: 100.0%  words/sec/thread: 199  lr: 0.000000  loss: 0.000000  eta: 0h0m
it cannot pass by design: training is non-deterministic, so conditions
must be tightly controlled to guarantee reproducibility, and that is too
much effort for a unit test
@menshikh-iv menshikh-iv changed the title WIP: fix critical issues in our fasttext implementation [WIP] Fix critical issues in FastText Jan 10, 2019
@menshikh-iv
Copy link
Contributor

Great work @mpenkov 🚀 what's missing before the merge

mpenkov and others added 6 commits January 10, 2019 16:29
This is an internal method masquerading as a public one.  There is no
reason for anyone to call it.  Removing it will have no effect on
pickling/unpickling, as methods do not get serialized.

Therefore, removing it is safe.
@piskvorky
Copy link
Owner

piskvorky commented Jan 11, 2019

We also identified divergent behavior with the Facebook implementation. This behavior is caused by an optimization that uses a smaller number of buckets than available.

I'd prefer to have the same implementation as FastText. Reasons:

  1. More straightforward compatibility, less surprises for both users and developers.
  2. If buckets take up too much memory, the user can specify fewer buckets (=up to the user, I see no reason to optimize this on our side).
  3. It looks better for quality to have the (random) vectors for different OOV ngrams contribute differently (like in FastText), not just be skipped = not contribute at all (like in Gensim now). We could probably generate the random vectors on the fly (deterministically), as an optimization, but I don't think that's needed/critical/urgent at this point, plus that would complicate incremental training too, when "OOV" bucket becomes "IV" with new data.

@menshikh-iv
Copy link
Contributor

Great, we conclude all stuff. Nice job @mpenkov 🔥
For #2313 (comment) we have separate issue #2329, time to merge current when Appveyor pass, I love Appveyor (sarcasm)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment