Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Load full native fastText model to continue training on new data #2160

Closed
tranhungnghiep opened this issue Aug 24, 2018 · 5 comments · Fixed by #2313
Closed
Assignees
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills fasttext Issues related to the FastText model

Comments

@tranhungnghiep
Copy link

tranhungnghiep commented Aug 24, 2018

Currently gensim cannot load and continue training native fastText model. According to the docs [1], this is because it only loads input-hidden matrix. However, fastText also saves hidden-output matrix [2].

Moreover, even the input-hidden matrix could support some sort of transfer learning, with hidden-output matrix inited randomly, similar to how gensim.models.Word2Vec.intersect_word2vec_format() works.

Please correct me if I'm wrong here, but I think there is no technical issue preventing loading and continue training fastText model. How about supporting this feature?

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Aug 27, 2018

@tranhungnghiep thanks for the request, as I remember, FB distribute 2 type of models

  • only vectors .vec file (i.e. no ngrams, only 1 matrix for words) in plain text format, for loading this, you should use KeyedVectors.load_word2vec_format
  • full model binary .bin, FastText.load_fasttext_format should be used for ngrams & continue an training process

I think that this is a bug of current implementation (this already should works)

from gensim.models import FastText
from gensim.test.utils import common_texts


m = FastText.load_fasttext_format("wiki.ru.bin")  # load wiki FB model from https://fasttext.cc/docs/en/pretrained-vectors.html
m.build_vocab(common_texts, update=True)  # this doesn't work, but should. See also https:/RaRe-Technologies/gensim/issues/2139 
"""
/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/fasttext.pyc in build_vocab(self, sentences, update, progress_per, keep_raw_vocab, trim_rule, **kwargs)
    480         return super(FastText, self).build_vocab(
    481             sentences, update=update, progress_per=progress_per,
--> 482             keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, **kwargs)
    483 
    484     def _set_train_params(self, **kwargs):

/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/base_any2vec.pyc in build_vocab(self, sentences, update, progress_per, keep_raw_vocab, trim_rule, **kwargs)
    805             trim_rule=trim_rule, **kwargs)
    806         report_values['memory'] = self.estimate_memory(vocab_size=report_values['num_retained_words'])
--> 807         self.trainables.prepare_weights(self.hs, self.negative, self.wv, update=update, vocabulary=self.vocabulary)
    808 
    809     def build_vocab_from_freq(self, word_freq, keep_raw_vocab=False, corpus_count=None, trim_rule=None, update=False):

/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/fasttext.pyc in prepare_weights(self, hs, negative, wv, update, vocabulary)
    932 
    933     def prepare_weights(self, hs, negative, wv, update=False, vocabulary=None):
--> 934         super(FastTextTrainables, self).prepare_weights(hs, negative, wv, update=update, vocabulary=vocabulary)
    935         self.init_ngrams_weights(wv, update=update, vocabulary=vocabulary)
    936 

/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/word2vec.pyc in prepare_weights(self, hs, negative, wv, update, vocabulary)
   1744             self.reset_weights(hs, negative, wv)
   1745         else:
-> 1746             self.update_weights(hs, negative, wv)
   1747 
   1748     def seeded_vector(self, seed_string, vector_size):

/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/word2vec.pyc in update_weights(self, hs, negative, wv)
   1791             self.syn1 = vstack([self.syn1, zeros((gained_vocab, self.layer1_size), dtype=REAL)])
   1792         if negative:
-> 1793             self.syn1neg = vstack([self.syn1neg, zeros((gained_vocab, self.layer1_size), dtype=REAL)])
   1794         wv.vectors_norm = None
   1795 

AttributeError: 'FastTextTrainables' object has no attribute 'syn1neg'
"""

m.train(common_texts, epochs=1, total_examples=len(common_texts))
"""
Exception in thread Thread-17:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/base_any2vec.py", line 164, in _worker_loop
    tally, raw_tally = self._do_train_job(data_iterable, job_parameters, thread_private_mem)
  File "/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/fasttext.py", line 555, in _do_train_job
    tally += train_batch_sg(self, sentences, alpha, work, neu1)
  File "gensim/models/fasttext_inner.pyx", line 276, in gensim.models.fasttext_inner.train_batch_sg
    cdef REAL_t *word_locks_vocab = <REAL_t *>(np.PyArray_DATA(model.trainables.vectors_vocab_lockf))
AttributeError: 'FastTextTrainables' object has no attribute 'vectors_vocab_lockf'
"""

Of course, I'm +1 for fix this issue -> training will work as @tranhungnghiep suggest.

Related issue - #2139

@menshikh-iv menshikh-iv added bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills labels Aug 27, 2018
@tranhungnghiep
Copy link
Author

tranhungnghiep commented Aug 27, 2018

@menshikh-iv Thanks for looking into it.

This issue is a more low-level problem, particularly FastText.load_fasttext_format() currently does not load the hidden-output matrix. After loading, we may need to do some checks and initializations related to #2139.

@aviclu
Copy link

aviclu commented Sep 30, 2020

Hi @menshikh-iv it seems that the hidden vectors are still bad. I'm using the gensim.models.fasttext.load_facebook_model function to load the .bin file, but the syn1 fails loading. Also trainables.syn1neg is full of zeros.

@menshikh-iv
Copy link
Contributor

Hi @aviclu, please post more information

  • reproducible code example
  • model file
  • stacktrace

@mpenkov
Copy link
Collaborator

mpenkov commented Oct 1, 2020

@aviclu Please open a new ticket and be sure to fill in the template.

Repository owner locked as resolved and limited conversation to collaborators Oct 1, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills fasttext Issues related to the FastText model
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants