FastText segfaults during training of skipgram model #2333

mpenkov · 2019-01-15T00:16:39Z

Given a data file:

Path: cantaloupe.srv.cs.cmu.edu!magnesium.club.cc.cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!pacific.mps.ohio-state.edu!zaphod.mps.ohio-state.edu!sol.ctr.columbia.edu!destroyer!cs.ubc.ca!mala.bc.ca!oneb!jc
Newsgroups: sci.med
Subject: Broken rib
Message-ID: <[email protected]>
From: [email protected]
Date: Tue, 20 Apr 93 17:52:00 PDT
Organization: The Old Frog's Almanac, Nanaimo, B.C.
Keywords: advice needed
Summary: long term problems?
Lines: 17

Hello,  I am not sure if this is the right conference to ask this
question, however, Here I go..  I am a commercial fisherman and I 
fell about 3 weeks ago down into the hold of the boat and broke or
cracked a rib and wrenched and bruised my back and left arm.
  My question,  I have been to a doctor and was told that it was 
best to do nothing and it would heal up with no long term effect, and 
indeed I am about 60 % better, however, the work I do is very 
hard and I am still not able to go back to work.  The thing that worries me
is the movement or "clunking" I feel and hear back there when I move

the following code segfaults:

import codecs                                                                                                                                                                                                 
import logging                                                                                                                                                                                                
import os                                                                                                                                                                                                     
import os.path as P                                                                                                                                                                                           
                                                                                                                                                                                                              
from gensim.models import FastText

class SentenceIter:
    def __init__(self, filename):
        self.filename = filename

    def __iter__(self):
        curr_dir = P.dirname(P.abspath(__file__))
        with codecs.open(P.join(curr_dir, self.filename), "r", 'utf-8', errors='replace') as fin:
            for line in fin:
                words_bad = line[:-1].split(" ")  # Yielding this causes a segfault
                words_good = line.rstrip().split(' ')  # Yielding this is OK
                logging.debug('words_bad: %r', words_bad)
                logging.debug('words_good: %r', words_good)
                logging.debug('---')
                yield words_good


def main():
   logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

   num_workers = os.cpu_count()
   model = FastText(
        SentenceIter("data-example.txt"),  # This works when the iterator yields words_good
        sg=1,
        size=100,
        window=3,
        min_count=5,
        workers=num_workers,
        iter=5,
        negative=20,
    )


if __name__ == "__main__":
    main()

The result is the same for the develop branch and the 3.6.0 release.

Setting sg=0 in the constructor avoids the segfault, so the problem could be skipgram-related.

Running in gdb gives the following output:

[Switching to Thread 0x7fffe7eb5700 (LWP 11302)]
__pyx_f_6gensim_6models_14fasttext_inner_fasttext_fast_sentence_sg_neg (__pyx_v_negative=<optimized out>, __pyx_v_cum_table=0x11b5560, __pyx_v_cum_table_len=5, __pyx_v_syn0_vocab=<optimized out>, 
    __pyx_v_syn0_ngrams=0x17fbfd0, __pyx_v_syn1neg=0x17f7340, __pyx_v_size=<optimized out>, __pyx_v_word_index=4, __pyx_v_word2_index=0, __pyx_v_subwords_index=0x165ea80, __pyx_v_subwords_len=0, 
    __pyx_v_alpha=0.0200200006, __pyx_v_work=0x7fffb0001200, __pyx_v_l1=0x7fffb8001000, __pyx_v_next_random=223366870975522, __pyx_v_word_locks_vocab=0x17f8dc0, __pyx_v_word_locks_ngrams=0x17f95a0)
    at ./gensim/models/fasttext_inner.c:2333
2333        __pyx_v_g = ((__pyx_v_label - __pyx_v_f) * __pyx_v_alpha);
(gdb) p __pyx_v_g
$1 = -1.28037043e+32
(gdb) p __pyx_v_label
$2 = 1
(gdb) p __pyx_v_f
Cannot access memory at address 0x7ffdd5ddd7c0
(gdb) p __pyx_v_alpha
$3 = 0.0200200006
(gdb)

The text was updated successfully, but these errors were encountered:

mpenkov · 2019-01-15T02:08:14Z

Related to issue #2235

menshikh-iv · 2019-01-15T04:34:57Z

Not reproduced (python3.7, gensim==3.6.0), hm, any ideas how to reproduce that @mpenkov?

mpenkov · 2019-01-15T08:37:17Z

Dockerfile:

FROM ubuntu:18.04
RUN apt-get update
RUN apt-get install python3 python3-pip wget -y
RUN pip3 install gensim==3.6.0

RUN mkdir /app
RUN wget https://gist.githubusercontent.com/mpenkov/3d64e712a7ae1ef157840b98e9c53d22/raw/bec9878e93052cadb07edb47222b1bbf1c746bb6/trigger.py -O /app/trigger.py

RUN wget http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz -O data.tar.gz
RUN tar xzf data.tar.gz
RUN find 20_newsgroups -type f | xargs cat > /app/data.txt
RUN rm -rf data.tar.gz 20_newsgroups

RUN python3 --version
RUN python3 /app/trigger.py

Save it in the current directory and run:

docker build -t gensim-2333 .

to reproduce the bug. You should see:

Step 11/11 : RUN python3 /app/trigger.py
 ---> Running in 4a8b66f4ab80
2019-01-15 09:29:06,901 : INFO : collecting all words and their counts
2019-01-15 09:29:06,902 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-01-15 09:29:07,009 : INFO : PROGRESS: at sentence #10000, processed 78320 words, keeping 16316 word types
2019-01-15 09:29:07,010 : INFO : collected 16318 word types from a corpus of 78327 raw words and 10001 sentences
2019-01-15 09:29:07,010 : INFO : Loading a fresh vocabulary
2019-01-15 09:29:07,020 : INFO : effective_min_count=5 retains 1684 unique words (10% of original 16318, drops 14634)
2019-01-15 09:29:07,020 : INFO : effective_min_count=5 leaves 57838 word corpus (73% of original 78327, drops 20489)
2019-01-15 09:29:07,024 : INFO : deleting the raw counts dictionary of 16318 items
2019-01-15 09:29:07,025 : INFO : sample=0.001 downsamples 45 most-common words
2019-01-15 09:29:07,025 : INFO : downsampling leaves estimated 35224 word corpus (60.9% of prior 57838)
2019-01-15 09:29:07,046 : INFO : estimated required memory for 1684 words, 17428 buckets and 100 dimensions: 9495136 bytes
2019-01-15 09:29:07,047 : INFO : resetting layer weights
2019-01-15 09:29:07,323 : INFO : Total number of ngrams is 17428
2019-01-15 09:29:07,453 : INFO : training model with 8 workers on 1684 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=20 window=3
Segmentation fault (core dumped)
The command '/bin/sh -c python3 /app/trigger.py' returned a non-zero code: 139

mpenkov · 2019-01-19T10:04:43Z

@menshikh-iv I think the Dockerfile above successfully reproduces the issue, so maybe we should remove the "need info" label from this ticket?

mpenkov added the bug Issue described a bug label Jan 15, 2019

mpenkov changed the title ~~FastText segfaults during training~~ FastText segfaults during training of skipgram model Jan 15, 2019

menshikh-iv added difficulty medium Medium issue: required good gensim understanding & python skills fasttext Issues related to the FastText model need info Not enough information for reproduce an issue, need more info from author labels Jan 15, 2019

menshikh-iv removed the need info Not enough information for reproduce an issue, need more info from author label Jan 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FastText segfaults during training of skipgram model #2333

FastText segfaults during training of skipgram model #2333

mpenkov commented Jan 15, 2019 •

edited

Loading

mpenkov commented Jan 15, 2019

menshikh-iv commented Jan 15, 2019

mpenkov commented Jan 15, 2019 •

edited

Loading

mpenkov commented Jan 19, 2019

FastText segfaults during training of skipgram model #2333

FastText segfaults during training of skipgram model #2333

Comments

mpenkov commented Jan 15, 2019 • edited Loading

mpenkov commented Jan 15, 2019

menshikh-iv commented Jan 15, 2019

mpenkov commented Jan 15, 2019 • edited Loading

mpenkov commented Jan 19, 2019

mpenkov commented Jan 15, 2019 •

edited

Loading

mpenkov commented Jan 15, 2019 •

edited

Loading