FastText save & callbacks suspicious behavior #2235

darentsia · 2018-10-18T07:49:16Z

Description

TODO: FastText model does not learn anything from the text corpus.

Steps/Code/Corpus to Reproduce

import os
import logging

from gensim.models import FastText
from gensim.models.callbacks import CallbackAny2Vec

class EpochSaver(CallbackAny2Vec):
    '''Callback to save model after each epoch and show training parameters '''

    def __init__(self, savedir):
        self.savedir = savedir
        self.epoch = 0
        os.makedirs(self.savedir, exist_ok=True)

    def on_epoch_end(self, model):
        savepath = os.path.join(self.savedir, "model_fastText_web_kw_sm{}_epoch.gz".format(self.epoch))
        model.save(savepath)
        print(
            "Epoch saved: {}".format(self.epoch + 1),
            "Start next epoch ... ", sep="\n"
            )
        if os.path.isfile(os.path.join(self.savedir, "model_fastText_web_kw_sm{}_epoch.gz".format(self.epoch - 1))):
            print("Previous model deleted ")
            os.remove(os.path.join(self.savedir, "model_fastText_web_kw_sm{}_epoch.gz".format(self.epoch - 1)))
        self.epoch += 1

class SentenceIter:
    def __iter__(self):
        with open("data/eng_tweets/20_news_groups_dataset.txt", "r") as f:
            for line in f:
                yield line[:-1].split(" ")

if __name__ == "__main__":

   logging.basicConfig(
   format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO
   )

   num_workers = os.cpu_count()
   model = FastText(
        SentenceIter(),
        sg=1,
        size=100,
        window=3,
        min_count=5,
        workers=num_workers,
        iter=5,
        negative=20
        callbacks=[EpochSaver("./checkpoints/fasttext_eng_tweets")]
    )

Expected Results

I expect to find in model.most_similar("word") something closer in meaning but found just a trash.
I took an open-source dataset from sklearn.datasets - fetch_20newsgroups.

Actual Results

And it changes very slightly from epoch to epoch, It can change slightly an order of this words, or change their similarity. But nothing changes during training. Nothing learns.

Also, what is important:

If I try to make a fasttext model from command line, I mean using this command:
./fasttext skipgram -input data.txt -output model (https:/facebookresearch/fastText) It shows good results, for example for apple we would receive: apples, apple's and so on.
Also If I change my model from FastText to Word2Vec - I can learn. Results are good.
Also If I don't use my EpochSaver, but just load and save model on each epoch manuall, for example:

for epoch in range(N_epochs):
    train model 
    save model

And then load your model before the next epoch starts, you can also receive good results.

So, the problem can be in EpochSaver, but can you explain please, why in Word2Vec's case it works, but here - don't.

Versions

Linux-4.15.0-24-generic-x86_64-with-Ubuntu-16.04-xenial
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609]
NumPy 1.14.5
SciPy 1.1.0
gensim 3.6.0
FAST_VERSION 1

The text was updated successfully, but these errors were encountered:

bunyamink · 2018-11-23T09:35:33Z

Same thing for me. When trained my wiki corpus with word2vec, I got 37% from analogy questions. But when I trained the same corpus with fasttext result is 3.3% from same analogy questions. Is there a problem in fasttext?

Gensim version: 3.6.0
Python Version: 3.6.4
Windows 10

menshikh-iv · 2018-12-14T07:20:29Z

Thanks for report @daridar, especially (3) makes me think that we have an issue with save method (i.e. this change a current model somehow).

menshikh-iv · 2018-12-14T07:20:42Z

CC @mpenkov

5cat · 2019-01-17T03:31:21Z

I have encountered the same problem while I was trying to train FastText model from big dataset.

Here is a simplifed version of the problem.

from gensim.models.fasttext import FastText
from gensim.models.word2vec import Word2Vec 
import gensim.downloader as api
import numpy as np
from tqdm import tqdm
from time import sleep
class list_iter:
	def __init__(self,array,model,see=np.nan,only_one_loop=False):
		self.array=array
		self.see=see
		self.model=model
		self.only_one_loop=only_one_loop
		self.tqdm_bar=tqdm(desc='iterations')
	def __iter__(self):
		while True:
			for item in self.array:
				self.tqdm_bar.update(1)
				
				if self.tqdm_bar.n%self.see==0:
					print('\nvector hash:'+str(hash(self.model['I'].tostring())))
					sleep(2)
					self.model.wv.save("model")
				yield item
			if self.only_one_loop:
				self.tqdm_bar.close()
				break


with open("tinyshakespeare.txt", 'r') as fp:
	corpus=[i.split() for i in fp.read().split('\n')]

model=FastText(workers=1)

model.build_vocab(list_iter(corpus,model,only_one_loop=True))

model.train(list_iter(corpus,model,see=10000),total_examples=99999999999999999,epochs=10)

the output of running this code is:

iterations: 40001it [00:00, 359624.54it/s]
iterations: 8178it [00:00, 81174.89it/s]
vector hash:-4933655588363529352
iterations: 17619it [00:02, 5406.74it/s]
vector hash:-4933655588363529352
iterations: 28266it [00:04, 4400.15it/s]
vector hash:-4933655588363529352
iterations: 39611it [00:06, 4396.16it/s]
vector hash:-4933655588363529352

the vector of the word is not changing and the model is not learning anything.
if i replaced the FastText(workers=1) with Word2Vec(workers=1) everything works fine and make sense and the vector is updated

iterations: 40001it [00:00, 359474.28it/s]
iterations: 0it [00:00, ?it/s]
vector hash:-3094244126925185959
iterations: 19618it [00:02, 6651.19it/s]
vector hash:1153644772814581057
iterations: 22603it [00:04, 3228.06it/s]
vector hash:5947032563406220642
iterations: 30001it [00:06, 3326.54it/s]
vector hash:-7484819002721531784

and by the way you can use any text file.
And i think the problem is not from the save method because even without saving it, the vector is the same after each iteration, when i check the hash of the file its different each time i save it while training, but for some reasons i can't see any changes to the vectors.
even when i tried to get back to gensim 3.1.0 the issue is still there.
why is that?
gensim==3.6.0
python==3.6.4

5cat · 2019-01-17T04:24:13Z

I think that i have fixed my problem, it looks like gensim team is working on solving it but its not released yet in the pip version?!
this is what i have done
pip3 uninstall gensim
then reinstall it with from this commit
pip3 install 'git+git:/RaRe-Technologies/gensim.git@b452a5b59f2f474dbbd275d0838c45df4d3c5aac'
then before i save the model i run this function

self.model.wv.adjust_vectors()
self.model.wv.save("model")

this solution is for my case but if you finished using the training function no need for using model.wv.adjust_vectors() since at the end of train function it does model.wv.adjust_vectors() by it self.

        super(FastText, self).train(
            sentences=sentences, corpus_file=corpus_file, total_examples=total_examples, total_words=total_words,
            epochs=epochs, start_alpha=start_alpha, end_alpha=end_alpha, word_count=word_count,
            queue_factor=queue_factor, report_delay=report_delay, callbacks=callbacks)
        self.wv.adjust_vectors()

menshikh-iv · 2019-01-17T06:48:25Z

I think that I have fixed my problem, it looks like gensim team is working on solving it but its not released yet in the pip version?!

yes, exactly, big thanks @mpenkov that helps us much with fasttext-related issues in #2313

menshikh-iv · 2019-01-17T06:51:44Z

I guess I can close this issue as fixed by #2313

CC: @mpenkov

menshikh-iv changed the title ~~FastText does not learn anything~~ FastText save & callbacks suspicious behavior Dec 14, 2018

menshikh-iv added bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills labels Dec 14, 2018

menshikh-iv assigned mpenkov Dec 14, 2018

mpenkov added the fasttext Issues related to the FastText model label Dec 15, 2018

mpenkov mentioned this issue Jan 15, 2019

FastText segfaults during training of skipgram model #2333

Open

menshikh-iv closed this as completed Jan 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FastText save & callbacks suspicious behavior #2235

FastText save & callbacks suspicious behavior #2235

darentsia commented Oct 18, 2018 •

edited by mpenkov

Loading

bunyamink commented Nov 23, 2018

menshikh-iv commented Dec 14, 2018

menshikh-iv commented Dec 14, 2018

5cat commented Jan 17, 2019

5cat commented Jan 17, 2019

menshikh-iv commented Jan 17, 2019

menshikh-iv commented Jan 17, 2019 •

edited

Loading

FastText save & callbacks suspicious behavior #2235

FastText save & callbacks suspicious behavior #2235

Comments

darentsia commented Oct 18, 2018 • edited by mpenkov Loading

Description

Steps/Code/Corpus to Reproduce

Expected Results

Actual Results

Versions

bunyamink commented Nov 23, 2018

menshikh-iv commented Dec 14, 2018

menshikh-iv commented Dec 14, 2018

5cat commented Jan 17, 2019

5cat commented Jan 17, 2019

menshikh-iv commented Jan 17, 2019

menshikh-iv commented Jan 17, 2019 • edited Loading

darentsia commented Oct 18, 2018 •

edited by mpenkov

Loading

menshikh-iv commented Jan 17, 2019 •

edited

Loading