Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FastText segfaults during training of skipgram model #2333

Open
mpenkov opened this issue Jan 15, 2019 · 4 comments
Open

FastText segfaults during training of skipgram model #2333

mpenkov opened this issue Jan 15, 2019 · 4 comments
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills fasttext Issues related to the FastText model

Comments

@mpenkov
Copy link
Collaborator

mpenkov commented Jan 15, 2019

Given a data file:

Path: cantaloupe.srv.cs.cmu.edu!magnesium.club.cc.cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!pacific.mps.ohio-state.edu!zaphod.mps.ohio-state.edu!sol.ctr.columbia.edu!destroyer!cs.ubc.ca!mala.bc.ca!oneb!jc
Newsgroups: sci.med
Subject: Broken rib
Message-ID: <[email protected]>
From: [email protected]
Date: Tue, 20 Apr 93 17:52:00 PDT
Organization: The Old Frog's Almanac, Nanaimo, B.C.
Keywords: advice needed
Summary: long term problems?
Lines: 17

Hello,  I am not sure if this is the right conference to ask this
question, however, Here I go..  I am a commercial fisherman and I 
fell about 3 weeks ago down into the hold of the boat and broke or
cracked a rib and wrenched and bruised my back and left arm.
  My question,  I have been to a doctor and was told that it was 
best to do nothing and it would heal up with no long term effect, and 
indeed I am about 60 % better, however, the work I do is very 
hard and I am still not able to go back to work.  The thing that worries me
is the movement or "clunking" I feel and hear back there when I move

the following code segfaults:

import codecs                                                                                                                                                                                                 
import logging                                                                                                                                                                                                
import os                                                                                                                                                                                                     
import os.path as P                                                                                                                                                                                           
                                                                                                                                                                                                              
from gensim.models import FastText

class SentenceIter:
    def __init__(self, filename):
        self.filename = filename

    def __iter__(self):
        curr_dir = P.dirname(P.abspath(__file__))
        with codecs.open(P.join(curr_dir, self.filename), "r", 'utf-8', errors='replace') as fin:
            for line in fin:
                words_bad = line[:-1].split(" ")  # Yielding this causes a segfault
                words_good = line.rstrip().split(' ')  # Yielding this is OK
                logging.debug('words_bad: %r', words_bad)
                logging.debug('words_good: %r', words_good)
                logging.debug('---')
                yield words_good


def main():
   logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

   num_workers = os.cpu_count()
   model = FastText(
        SentenceIter("data-example.txt"),  # This works when the iterator yields words_good
        sg=1,
        size=100,
        window=3,
        min_count=5,
        workers=num_workers,
        iter=5,
        negative=20,
    )


if __name__ == "__main__":
    main()

The result is the same for the develop branch and the 3.6.0 release.

Setting sg=0 in the constructor avoids the segfault, so the problem could be skipgram-related.

Running in gdb gives the following output:

[Switching to Thread 0x7fffe7eb5700 (LWP 11302)]
__pyx_f_6gensim_6models_14fasttext_inner_fasttext_fast_sentence_sg_neg (__pyx_v_negative=<optimized out>, __pyx_v_cum_table=0x11b5560, __pyx_v_cum_table_len=5, __pyx_v_syn0_vocab=<optimized out>, 
    __pyx_v_syn0_ngrams=0x17fbfd0, __pyx_v_syn1neg=0x17f7340, __pyx_v_size=<optimized out>, __pyx_v_word_index=4, __pyx_v_word2_index=0, __pyx_v_subwords_index=0x165ea80, __pyx_v_subwords_len=0, 
    __pyx_v_alpha=0.0200200006, __pyx_v_work=0x7fffb0001200, __pyx_v_l1=0x7fffb8001000, __pyx_v_next_random=223366870975522, __pyx_v_word_locks_vocab=0x17f8dc0, __pyx_v_word_locks_ngrams=0x17f95a0)
    at ./gensim/models/fasttext_inner.c:2333
2333        __pyx_v_g = ((__pyx_v_label - __pyx_v_f) * __pyx_v_alpha);
(gdb) p __pyx_v_g
$1 = -1.28037043e+32
(gdb) p __pyx_v_label
$2 = 1
(gdb) p __pyx_v_f
Cannot access memory at address 0x7ffdd5ddd7c0
(gdb) p __pyx_v_alpha
$3 = 0.0200200006
(gdb)
@mpenkov mpenkov added the bug Issue described a bug label Jan 15, 2019
@mpenkov mpenkov changed the title FastText segfaults during training FastText segfaults during training of skipgram model Jan 15, 2019
@mpenkov
Copy link
Collaborator Author

mpenkov commented Jan 15, 2019

Related to issue #2235

@menshikh-iv menshikh-iv added difficulty medium Medium issue: required good gensim understanding & python skills fasttext Issues related to the FastText model need info Not enough information for reproduce an issue, need more info from author labels Jan 15, 2019
@menshikh-iv
Copy link
Contributor

Not reproduced (python3.7, gensim==3.6.0), hm, any ideas how to reproduce that @mpenkov?

@mpenkov
Copy link
Collaborator Author

mpenkov commented Jan 15, 2019

Dockerfile:

FROM ubuntu:18.04
RUN apt-get update
RUN apt-get install python3 python3-pip wget -y
RUN pip3 install gensim==3.6.0

RUN mkdir /app
RUN wget https://gist.githubusercontent.com/mpenkov/3d64e712a7ae1ef157840b98e9c53d22/raw/bec9878e93052cadb07edb47222b1bbf1c746bb6/trigger.py -O /app/trigger.py

RUN wget http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz -O data.tar.gz
RUN tar xzf data.tar.gz
RUN find 20_newsgroups -type f | xargs cat > /app/data.txt
RUN rm -rf data.tar.gz 20_newsgroups

RUN python3 --version
RUN python3 /app/trigger.py

Save it in the current directory and run:

docker build -t gensim-2333 .

to reproduce the bug. You should see:

Step 11/11 : RUN python3 /app/trigger.py
 ---> Running in 4a8b66f4ab80
2019-01-15 09:29:06,901 : INFO : collecting all words and their counts
2019-01-15 09:29:06,902 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-01-15 09:29:07,009 : INFO : PROGRESS: at sentence #10000, processed 78320 words, keeping 16316 word types
2019-01-15 09:29:07,010 : INFO : collected 16318 word types from a corpus of 78327 raw words and 10001 sentences
2019-01-15 09:29:07,010 : INFO : Loading a fresh vocabulary
2019-01-15 09:29:07,020 : INFO : effective_min_count=5 retains 1684 unique words (10% of original 16318, drops 14634)
2019-01-15 09:29:07,020 : INFO : effective_min_count=5 leaves 57838 word corpus (73% of original 78327, drops 20489)
2019-01-15 09:29:07,024 : INFO : deleting the raw counts dictionary of 16318 items
2019-01-15 09:29:07,025 : INFO : sample=0.001 downsamples 45 most-common words
2019-01-15 09:29:07,025 : INFO : downsampling leaves estimated 35224 word corpus (60.9% of prior 57838)
2019-01-15 09:29:07,046 : INFO : estimated required memory for 1684 words, 17428 buckets and 100 dimensions: 9495136 bytes
2019-01-15 09:29:07,047 : INFO : resetting layer weights
2019-01-15 09:29:07,323 : INFO : Total number of ngrams is 17428
2019-01-15 09:29:07,453 : INFO : training model with 8 workers on 1684 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=20 window=3
Segmentation fault (core dumped)
The command '/bin/sh -c python3 /app/trigger.py' returned a non-zero code: 139

@mpenkov
Copy link
Collaborator Author

mpenkov commented Jan 19, 2019

@menshikh-iv I think the Dockerfile above successfully reproduces the issue, so maybe we should remove the "need info" label from this ticket?

@menshikh-iv menshikh-iv removed the need info Not enough information for reproduce an issue, need more info from author label Jan 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills fasttext Issues related to the FastText model
Projects
None yet
Development

No branches or pull requests

2 participants