Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pretrain T2V - Width of CNN layers. #3979

Closed
agombert opened this issue Jul 17, 2019 · 8 comments · Fixed by #5021
Closed

Pretrain T2V - Width of CNN layers. #3979

agombert opened this issue Jul 17, 2019 · 8 comments · Fixed by #5021
Labels
feat / tok2vec Feature: Token-to-vector layer and pretraining usage General spaCy usage

Comments

@agombert
Copy link

agombert commented Jul 17, 2019

Hello,

I tried to pretrain a model with the CNN architecture, but I would like to change the width of the CNN layer to get bigger vectors at the end (128 instead of 96).

And so I get an error about broadcast ValueError: could not broadcast input array from shape (128) into shape (96). Which looks like to come from the pretraining change of the CNN parameters.

How to reproduce the behaviour

I followed those steps:

1st step - W2V init

I trained on the same text corpus a W2V model, I wanted to use this W2V model as inputs to learn from it. /w2v_vectors.txt.gz came from gensim modeling.

python -m spacy init-model es /path/to/my/W2V/ --vectors-loc /path/to/my/w2v_vectors.txt.gz

2nd step - model from W2V

I used the train doc to train my new model without any problem

python -m spacy train es /path/to/my/model_with_w2v/  es_ancora-ud-train.json es_ancora-ud-dev.json --vectors /path/to/my/W2V/

3rd step - pretrain

I pretrained the model, as explained in the doc with the following command:

python -m spacy pretrain /path/to/my/texts.jsonl /path/to/W2V/model /path/to/my/t2v/ -i 50 -cw 128

4th step - train

And well after I got my pretrain processed, I try to train from the new token2vec:

python -m spacy train es /path/to/my/model_with_t2v/  es_ancora-ud-train.json es_ancora-ud-dev.json -t2v /path/to/my/t2v/model49.bin

And well I get this error:

Traceback (most recent call last):
   File "/home/jovyan/environments/word_emb/lib/python3.6/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec)
   File "/home/jovyan/environments/word_emb/lib/python3.6/runpy.py", line 85, in _run_code exec( code, run_globals)
   File "/home/jovyan/environments/word_emb/lib/python3.6/site-packages/spacy/__main__.py", line 35, in <module> plac.call(commands[command], sys.argv[1:])
   File "/home/jovyan/environments/word_emb/lib/python3.6/site-packages/plac_core.py", line 328, in call cmd, result = parser;consume(arglist)
   File "/home/jovyan/environments/word_emb/lib/python3.6/site-packages/plac_core.py", line 207, in consume return cmd, self.func(*(args + varargs + extraopts), **kwargs)
   File "/home/jovyan/environments/word_emb/lib/python3.6/spacy/cli/train.py", line 219, in train components = _load_pretrained_tok2vec(nlp, init_tok2vec)
   File "/home/jovyan/environments/word_emb/lib/python3.6/spacy/cli/train.py", line 417, in _load_pretrained_tok2vec component.tok2vec/from_bytes(weights_data)
   File "/home/jovyan/environments/word_emb/lib/python3.6/think/neural/_classes/model.py", line 372, in from_bytes copy_array(dest, param[b"value"])
   File "/home/jovyan/environments/word_emb/lib/python3.6/think/neural/util.py", line 124, in  copy_array(dest, param[b"value"])
ValueError: could not broadcast input array from shape (128) into shape (96)

Other information about the bug:

When I use it without the -cw 128 everything works well.

Moreover, I can perform the training if I put an alias in the training such as token_vector_width=128 alias. And when I'm doing so, it looks like it's training ok but I get this error when trying to load the new t2v model:

ValueError                                Traceback (most recent call last)
<ipython-input-23-b9d787f3e370> in <module>
----> 1 nlp = spacy.load('/home/jovyan/words-representation/data/external/20190717_es_iomed_128_alias/model0')
      2 nlp1 = spacy.load('es_core_news_md')

~/environments/word_emb/lib/python3.6/site-packages/spacy/__init__.py in load(name, **overrides)
     25     if depr_path not in (True, False, None):
     26         deprecation_warning(Warnings.W001.format(path=depr_path))
---> 27     return util.load_model(name, **overrides)
     28 
     29 

~/environments/word_emb/lib/python3.6/site-packages/spacy/util.py in load_model(name, **overrides)
    131             return load_model_from_package(name, **overrides)
    132         if Path(name).exists():  # path to model data directory
--> 133             return load_model_from_path(Path(name), **overrides)
    134     elif hasattr(name, "exists"):  # Path or Path-like to model data
    135         return load_model_from_path(name, **overrides)

~/environments/word_emb/lib/python3.6/site-packages/spacy/util.py in load_model_from_path(model_path, meta, **overrides)
    171             component = nlp.create_pipe(name, config=config)
    172             nlp.add_pipe(component, name=name)
--> 173     return nlp.from_disk(model_path)
    174 
    175 

~/environments/word_emb/lib/python3.6/site-packages/spacy/language.py in from_disk(self, path, exclude, disable)
    789             # Convert to list here in case exclude is (default) tuple
    790             exclude = list(exclude) + ["vocab"]
--> 791         util.from_disk(path, deserializers, exclude)
    792         self._path = path
    793         return self

~/environments/word_emb/lib/python3.6/site-packages/spacy/util.py in from_disk(path, readers, exclude)
    628         # Split to support file names like meta.json
    629         if key.split(".")[0] not in exclude:
--> 630             reader(path / key)
    631     return path
    632 

~/environments/word_emb/lib/python3.6/site-packages/spacy/language.py in <lambda>(p, proc)
    785             if not hasattr(proc, "from_disk"):
    786                 continue
--> 787             deserializers[name] = lambda p, proc=proc: proc.from_disk(p, exclude=["vocab"])
    788         if not (path / "vocab").exists() and "vocab" not in exclude:
    789             # Convert to list here in case exclude is (default) tuple

pipes.pyx in spacy.pipeline.pipes.Tagger.from_disk()

~/environments/word_emb/lib/python3.6/site-packages/spacy/util.py in from_disk(path, readers, exclude)
    628         # Split to support file names like meta.json
    629         if key.split(".")[0] not in exclude:
--> 630             reader(path / key)
    631     return path
    632 

pipes.pyx in spacy.pipeline.pipes.Tagger.from_disk.load_model()

pipes.pyx in spacy.pipeline.pipes.Tagger.from_disk.load_model()

~/environments/word_emb/lib/python3.6/site-packages/thinc/neural/_classes/model.py in from_bytes(self, bytes_data)
    370                         name = name.decode("utf8")
    371                     dest = getattr(layer, name)
--> 372                     copy_array(dest, param[b"value"])
    373                 i += 1
    374             if hasattr(layer, "_layers"):

~/environments/word_emb/lib/python3.6/site-packages/thinc/neural/util.py in copy_array(dst, src, casting, where)
    122 def copy_array(dst, src, casting="same_kind", where=None):
    123     if isinstance(dst, numpy.ndarray) and isinstance(src, numpy.ndarray):
--> 124         dst[:] = src
    125     elif is_cupy_array(dst):
    126         src = cupy.array(src, copy=False)

ValueError: could not broadcast input array from shape (128) into shape (96)

Besides, when I disable the tagger when loading the t2v model, vectors of size 0 came from the t2v model.

Your Environment

Linux-4.9.0-7-amd64-x86_64-with-debian-buster-sid
Python 3.6.7 | packaged by conda-forge | (default, Feb 28 2019, 09:07:38)
[GCC 7.3.0]
spacy 2.1.4
gensim 3.7.2
thinc 7.0.8

@honnibal
Copy link
Member

@agombert This isn't very user-friendly currently, sorry. We should be inferring these settings from the pretrained file, or at least exposing a better error message.

What you need to do at the moment is set the environment variable token_vector_width=128 before you run spacy train. This will tell change the setting to match your pretrained weights.

@honnibal honnibal added the usage General spaCy usage label Jul 17, 2019
@agombert
Copy link
Author

Hey @honnibal,

Thank you for the quick answer. Actually that what I did with the alias, I put the token_vector_width=128, but when I load the model, I get the second error message I posted.

@honnibal
Copy link
Member

Hmm. As a work-around, does it work if you also set the environment variable when you load? It should work without it, but it seems there might be a missing setting written out in the config files.

@agombert
Copy link
Author

agombert commented Jul 17, 2019

I loaded as:

nlp = spacy.load('/path/to/my/t2v/model0', meta={"lang":"es", "token_vector_width":128})

It loads, but I have 0 size vectors.

EDIT:

When I use the same line after the default pretrain/train with "token_vector_width":96, I have also 0 size vectors.

@ines ines added the feat / tok2vec Feature: Token-to-vector layer and pretraining label Jul 17, 2019
@honnibal
Copy link
Member

Did you try setting it as an environment variable, instead of passing it in the meta like that?

@agombert
Copy link
Author

I have just tried, but same mistake at each step.

@agombert
Copy link
Author

Hi,

I would like to know @honnibal if you have found something about this error.

Moreover, I trained a normal BERT-like model as presented in the documentation. And when I want to load it without any element of the pipeline (tagger, parser and ner), the vectors are lost and we have 0 size vectors. In fact, we have to load the tagger each time so the model provides 96 length vectors. Is it normal ?

@lock
Copy link

lock bot commented Mar 17, 2020

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Mar 17, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / tok2vec Feature: Token-to-vector layer and pretraining usage General spaCy usage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants