Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to train textcat with en_trf_bertbaseuncased_lg model #4833

Closed
acherednychenko opened this issue Dec 23, 2019 · 3 comments · Fixed by #4834
Closed

Unable to train textcat with en_trf_bertbaseuncased_lg model #4833

acherednychenko opened this issue Dec 23, 2019 · 3 comments · Fixed by #4834
Labels
feat / textcat Feature: Text Classifier training Training and updating models

Comments

@acherednychenko
Copy link

How to reproduce the behaviour

Use train_textcat.py to reproduce. Training works for core models, but unfortunately not for en_trf_bertbaseuncased_lg

Running:
python train_textcat.py -m "en_trf_bertbaseuncased_lg" -n 1
returns:

  File "train_textcat.py", line 159, in <module>
    plac.call(main)
  File "/workspace/code/activity-classification/venv_act/lib/python3.6/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/workspace/code/activity-classification/venv_act/lib/python3.6/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "train_textcat.py", line 86, in main
    nlp.update(texts, annotations, sgd=optimizer, drop=0.2, losses=losses)
  File "/workspace/code/activity-classification/venv_act/lib/python3.6/site-packages/spacy_transformers/language.py", line 81, in update
    tok2vec = self.get_pipe(PIPES.tok2vec)
  File "/workspace/code/activity-classification/venv_act/lib/python3.6/site-packages/spacy/language.py", line 286, in get_pipe
    raise KeyError(Errors.E001.format(name=name, opts=self.pipe_names))
KeyError: "[E001] No component 'trf_tok2vec' found in pipeline. Available names: ['textcat']"
(venv_act) ```

Running training on the core model, works fine though:
`python train_textcat.py -m "en_core_web_md" -n 1`


## Your Environment
* **spaCy version:** 2.2.1 (also tested with 2.2.3)
* **Platform:** Linux-4.14.62-70.117.amzn2.x86_64-x86_64-with-debian-stretch-sid
* **Python version:** 3.6.9

Please assist,

PS: Love your products :-)
@svlandeg svlandeg added training Training and updating models feat / textcat Feature: Text Classifier labels Dec 24, 2019
@svlandeg
Copy link
Member

svlandeg commented Dec 24, 2019

I don't think this is supposed to work with the en_trf_bertbaseuncased_lg model, only with the "regular" spaCy ones. For the transformer models, slightly different names are used for the pipeline components: trf_textcat instead of textcat, trf_tok2vec instead of tok2vec, etc. By using the regular spaCy code here in combination with the transformer model, this gets messed up.

You should be able to run https:/explosion/spacy-transformers/blob/master/examples/train_textcat.py with en_trf_bertbaseuncased_lg though!

PS: also see the docs:

The trf_textcat component is based on spaCy's built-in TextCategorizer and supports using the features assigned by the transformers models, via the trf_tok2vec component. This lets you use a model like BERT to predict contextual token representations, and then learn a text categorizer on top as a task-specific "head".

@svlandeg
Copy link
Member

Update: it looks like it's only a small fix to actually get this example script working with the transformer model. The key is to make sure that the trf_wordpiecer and trf_tok2vec are NOT disabled during training. See also PR 4834 as linked above.

@lock
Copy link

lock bot commented Jan 24, 2020

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Jan 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / textcat Feature: Text Classifier training Training and updating models
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants