Can word vectors have an impact on Textcat? #4009

romlatron · 2019-07-23T09:30:28Z

I have a model with NER and Textcat components, using custom word vectors.
While the impact of the vectors is clear on the NER, there doesn't seem to be any difference on the Textcat whether it is loaded with or without the vocabulary (and thus the vectors).
Is there a way to get the best of the word vectors to improve my textcat component?

Which page or section is this issue related to?

https://spacy.io/api/textcategorizer

honnibal · 2019-07-23T10:14:00Z

Thanks for the report, this does look like a bug. I should have a workaround for you shortly.

honnibal · 2019-07-23T10:26:54Z

The root cause is that the mechanism by which spaCy decides to use the pretrained vectors is pretty messy, which has led to a number of bugs.

In the nlp.begin_training() method, we're checking whether vectors are loaded, and then adding the pretrained_vectors setting into the components' config for when we call the components' .begin_training() methods.

In many examples, we call textcat.begin_training() directly, instead of using the nlp.begin_training(). There's no equivalent logic within textcat.begin_training(), which means the vectors aren't loaded in this way.

The workaround is to add the keyword argument to the call: textcat.begin_training(pretrained_vectors=nlp.vocab.vectors.name)

This bug should be resolved by making sure the components' .begin_training() methods infer the value for pretrained_vectors if it's missing. If the keyword has the argument pretrained_vectors=False or pretrained_vectors=None, we should avoid using the pretrained vectors.

romlatron · 2019-07-24T08:18:12Z

I still didn't get it to work, calling nlp.begin_training(pretrained_vectors=nlp.vocab.vectors.name) and then nlp.update(), the component still doesn't depend on the vocabulary of the model when I call it.
I originally use Prodigy for model training so I am a bit unclear on how it all works here, is it better to call textcat.begin_training() instead?

romlatron · 2019-08-01T13:46:14Z

Hi,
It still doesn't work for me using nlp.begin_training(pretrained_vectors=nlp.vocab.vectors.name), is there something I could possibly forget?

BreakBB · 2019-08-01T13:54:32Z

The pretrained_vectors parameter should only be nessesary if calling textcat.begin_training not nlp.begin_training.

Could you add some more code to show how you're loading your model and trying to train it?

romlatron · 2019-08-01T14:39:38Z

Thanks for the answer, here is the training code adapted from the example shown on Spacy's website:

    nlp = spacy.load(model)
    textcat = TextCategorizer(nlp.vocab)
    textcat.add_label(label)
    nlp.pipeline.append(('textcat', textcat))
    (train_texts, train_cats), (dev_texts, dev_cats) = load_data(label)

    train_data = list(zip(train_texts, [{"cats": cats} for cats in train_cats]))

    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"]
    with nlp.disable_pipes(*other_pipes):  # only train textcat
        optimizer = nlp.begin_training(pretrained_vectors=nlp.vocab.vectors.name)
        batch_sizes = compounding(4.0, 32.0, 1.001)
        for i in range(n_iter):
            losses = {}
            # batch up the examples using spaCy's minibatch
            random.shuffle(train_data)
            batches = minibatch(train_data, size=batch_sizes)
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(texts, annotations, sgd=optimizer, losses=losses)

I tried calling nlp.begin_training with no arguments but it doesn't work either.

romlatron · 2019-08-19T09:25:48Z

Hi, could I get any update on this? Maybe just some short code example of how it is supposed to work?

svlandeg · 2019-10-09T07:17:13Z

@honnibal : could this be due to the method build_text_classifier depending on the config parameter pretrained_dims which is set to 0 by default ?

[EDIT] (GH is being annoying): should we remove that parameter all together and only rely on pretrained_vectors ?

svlandeg · 2020-02-12T09:24:51Z

@romlatron : apologies for the late follow-up, this will be fixed in the next version, cf #5004.

lock · 2020-03-17T16:37:10Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added the bug Bugs and behaviour differing from documentation label Jul 23, 2019

honnibal added feat / textcat Feature: Text Classifier feat / vectors Feature: Word vectors and similarity labels Jul 23, 2019

svlandeg mentioned this issue Oct 9, 2019

Which one of TextCategorizer architectures use our pre-trained word vectors? #4407

Closed

svlandeg mentioned this issue Feb 12, 2020

Fix using pre-trained vectors in textcat #5004

Merged

3 tasks

honnibal closed this as completed in #5004 Feb 16, 2020

lock bot locked as resolved and limited conversation to collaborators Mar 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can word vectors have an impact on Textcat? #4009

Can word vectors have an impact on Textcat? #4009

romlatron commented Jul 23, 2019

honnibal commented Jul 23, 2019

honnibal commented Jul 23, 2019

romlatron commented Jul 24, 2019

romlatron commented Aug 1, 2019

BreakBB commented Aug 1, 2019

romlatron commented Aug 1, 2019

romlatron commented Aug 19, 2019

svlandeg commented Oct 9, 2019 •

edited

Loading

svlandeg commented Feb 12, 2020

lock bot commented Mar 17, 2020

Can word vectors have an impact on Textcat? #4009

Can word vectors have an impact on Textcat? #4009

Comments

romlatron commented Jul 23, 2019

Which page or section is this issue related to?

honnibal commented Jul 23, 2019

honnibal commented Jul 23, 2019

romlatron commented Jul 24, 2019

romlatron commented Aug 1, 2019

BreakBB commented Aug 1, 2019

romlatron commented Aug 1, 2019

romlatron commented Aug 19, 2019

svlandeg commented Oct 9, 2019 • edited Loading

svlandeg commented Feb 12, 2020

lock bot commented Mar 17, 2020

svlandeg commented Oct 9, 2019 •

edited

Loading