Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can word vectors have an impact on Textcat? #4009

Closed
romlatron opened this issue Jul 23, 2019 · 10 comments · Fixed by #5004
Closed

Can word vectors have an impact on Textcat? #4009

romlatron opened this issue Jul 23, 2019 · 10 comments · Fixed by #5004
Labels
bug Bugs and behaviour differing from documentation feat / textcat Feature: Text Classifier feat / vectors Feature: Word vectors and similarity

Comments

@romlatron
Copy link

I have a model with NER and Textcat components, using custom word vectors.
While the impact of the vectors is clear on the NER, there doesn't seem to be any difference on the Textcat whether it is loaded with or without the vocabulary (and thus the vectors).
Is there a way to get the best of the word vectors to improve my textcat component?

Which page or section is this issue related to?

https://spacy.io/api/textcategorizer

@honnibal honnibal added the bug Bugs and behaviour differing from documentation label Jul 23, 2019
@honnibal
Copy link
Member

Thanks for the report, this does look like a bug. I should have a workaround for you shortly.

@honnibal
Copy link
Member

The root cause is that the mechanism by which spaCy decides to use the pretrained vectors is pretty messy, which has led to a number of bugs.

In the nlp.begin_training() method, we're checking whether vectors are loaded, and then adding the pretrained_vectors setting into the components' config for when we call the components' .begin_training() methods.

In many examples, we call textcat.begin_training() directly, instead of using the nlp.begin_training(). There's no equivalent logic within textcat.begin_training(), which means the vectors aren't loaded in this way.

The workaround is to add the keyword argument to the call: textcat.begin_training(pretrained_vectors=nlp.vocab.vectors.name)

This bug should be resolved by making sure the components' .begin_training() methods infer the value for pretrained_vectors if it's missing. If the keyword has the argument pretrained_vectors=False or pretrained_vectors=None, we should avoid using the pretrained vectors.

@honnibal honnibal added feat / textcat Feature: Text Classifier feat / vectors Feature: Word vectors and similarity labels Jul 23, 2019
@romlatron
Copy link
Author

I still didn't get it to work, calling nlp.begin_training(pretrained_vectors=nlp.vocab.vectors.name) and then nlp.update(), the component still doesn't depend on the vocabulary of the model when I call it.
I originally use Prodigy for model training so I am a bit unclear on how it all works here, is it better to call textcat.begin_training() instead?

@romlatron
Copy link
Author

Hi,
It still doesn't work for me using nlp.begin_training(pretrained_vectors=nlp.vocab.vectors.name), is there something I could possibly forget?

@BreakBB
Copy link
Contributor

BreakBB commented Aug 1, 2019

The pretrained_vectors parameter should only be nessesary if calling textcat.begin_training not nlp.begin_training.

Could you add some more code to show how you're loading your model and trying to train it?

@romlatron
Copy link
Author

Thanks for the answer, here is the training code adapted from the example shown on Spacy's website:

    nlp = spacy.load(model)
    textcat = TextCategorizer(nlp.vocab)
    textcat.add_label(label)
    nlp.pipeline.append(('textcat', textcat))
    (train_texts, train_cats), (dev_texts, dev_cats) = load_data(label)

    train_data = list(zip(train_texts, [{"cats": cats} for cats in train_cats]))

    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"]
    with nlp.disable_pipes(*other_pipes):  # only train textcat
        optimizer = nlp.begin_training(pretrained_vectors=nlp.vocab.vectors.name)
        batch_sizes = compounding(4.0, 32.0, 1.001)
        for i in range(n_iter):
            losses = {}
            # batch up the examples using spaCy's minibatch
            random.shuffle(train_data)
            batches = minibatch(train_data, size=batch_sizes)
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(texts, annotations, sgd=optimizer, losses=losses)

I tried calling nlp.begin_training with no arguments but it doesn't work either.

@romlatron
Copy link
Author

Hi, could I get any update on this? Maybe just some short code example of how it is supposed to work?

@svlandeg
Copy link
Member

svlandeg commented Oct 9, 2019

@honnibal : could this be due to the method build_text_classifier depending on the config parameter pretrained_dims which is set to 0 by default ?

[EDIT] (GH is being annoying): should we remove that parameter all together and only rely on pretrained_vectors ?

@svlandeg
Copy link
Member

@romlatron : apologies for the late follow-up, this will be fixed in the next version, cf #5004.

@lock
Copy link

lock bot commented Mar 17, 2020

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Mar 17, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation feat / textcat Feature: Text Classifier feat / vectors Feature: Word vectors and similarity
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants