Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NER training in command line, the final model send everything back as an entities. #2185

Closed
KMohaghegh opened this issue Apr 4, 2018 · 5 comments
Labels
feat / ner Feature: Named Entity Recognizer training Training and updating models

Comments

@KMohaghegh
Copy link

KMohaghegh commented Apr 4, 2018

Hi guys and thanks for your fantastic job, I have a problem regarding the NER training via command line which I will explain in the following. I train the NER with new entities in command line exact the same as explained in spaCy document. I am only interested to trained the NER. Both the training and the dev input are json format (as they should be).

The command I use is as follows:

python -m spacy train en ../Desktop/Spacy/Model-Train ../Desktop/Spacy/242018/gutenberg_devu.txt.json ../Desktop/242018/gutenbergu.txt.json -n 20
-P -T

the out put is:
dropout_from = 0.2 by default
dropout_to = 0.2 by default
dropout_decay = 0.0 by default
batch_from = 1 by default
batch_to = 16 by default
batch_compound = 1.001 by default
max_doc_len = 5000 by default
beam_width = 1 by default
beam_density = 0.0 by default
learn_rate = 0.001 by default
optimizer_B1 = 0.9 by default
optimizer_B2 = 0.999 by default
optimizer_eps = 1e-08 by default
L2_penalty = 1e-06 by default
grad_norm_clip = 1.0 by default
parser_hidden_depth = 1 by default
parser_maxout_pieces = 2 by default
token_vector_width = 128 by default
hidden_width = 200 by default
embed_size = 7000 by default
history_feats = 0 by default
history_width = 0 by default
Itn. P.Loss N.Loss UAS NER P. NER R. NER F. Tag % Token %
0 0.000 0.000 0.000 1.913 20.379 3.498 0.000 100.000 17850.40.0
1 0.000 0.000 0.000 1.692 6.833 2.712 0.000 100.000 18645.00.0
2 0.000 0.000 0.000 2.290 17.970 4.063 0.000 100.000 17854.40.0
3 0.000 0.000 0.000 1.708 19.550 3.141 0.000 100.000 18009.60.0
4 0.000 0.000 0.000 0.902 11.493 1.672 0.000 100.000 14274.50.0
5 0.000 0.000 0.000 1.772 15.679 3.183 0.000 100.000 14633.40.0
6 0.000 0.000 0.000 2.617 42.299 4.930 0.000 100.000 15909.50.0
7 0.000 0.000 0.000 1.394 2.686 1.835 0.000 100.000 17427.70.0
8 0.000 0.000 0.000 1.308 4.502 2.027 0.000 100.000 16058.40.0
9 0.000 0.000 0.000 2.867 10.821 4.533 0.000 100.000 12611.60.0
10 0.000 0.000 0.000 0.904 4.502 1.506 0.000 100.000 17244.90.0
11 0.000 0.000 0.000 1.023 9.400 1.846 0.000 100.000 16253.10.0
12 0.000 0.000 0.000 2.470 25.592 4.506 0.000 100.000 18050.80.0
13 0.000 0.000 0.000 0.935 3.476 1.473 0.000 100.000 17420.50.0
14 0.000 0.000 0.000 1.372 6.714 2.279 0.000 100.000 16866.80.0
15 0.000 0.000 0.000 1.786 14.100 3.170 0.000 100.000 15634.50.0
16 0.000 0.000 0.000 2.234 13.981 3.852 0.000 100.000 17915.30.0
17 0.000 0.000 0.000 0.693 3.081 1.131 0.000 100.000 17383.00.0
18 0.000 0.000 0.000 1.540 3.002 2.036 0.000 100.000 18024.50.0
19 0.000 0.000 0.000 2.552 27.567 4.672 0.000 100.000 17402.80.0
Saving model...

but when I load the final model is terrible and send me back every token in the text file is an entities.
If some one had experience something similar or just have a quick look and find it out where is my problem, it is really appreciated.

Just add if I run the code in "Training the named entity recognizer" mentioned in documentation and create a model (train a new type entity orating to existing one) everything is fine and the model works totally fine, as several times mentioned on other tickets, that one do not fit huge data set and easily crashed when the number of tokens in training data is more than few hundred thousands. So the command line above do not crash with large training data but has a very strange behaviour!

A part of the results is like this:

NAME crisis with the
NAME positions
NAME . therefore
NAME in portfolios
NAME .
NAME portfolios are
everything is entity (even the punctuation)!!!

Additional if someone explain what are these abbreviation in the output it also helps a lot:
P.Loss N.Loss UAS NER P. NER R. NER F. Tag % Token %
(take a look at output and all the first three columns are zero, Is it normal? )

I run python 2.7.10
on Mac with
Spacy 2.0.9

Thanks in advance for your help.

@KarthikPunyamurthy
Copy link

KarthikPunyamurthy commented Apr 5, 2018

  1. I have been working on the same thing. but when i load the model-final there are no entities fetched and also all the columns of losses correspond to 0 except for the N.loss , would appreciate a solution for this problem for CLI train command

  2. And the use of hyper_parameters from the CLI is difficult, can we directly change it in the script of train.py

3)And also want to know if we can use the en_vectors_web_lg for the cli train command, because when i use it with a -v flag the python stops and crashes for the 1st iteration

4)As the train data has a format of json , how should the dev_data be arranged in the json file

The Error :

C:\Users\karth>python -m spacy train en "C:\Users\karth\AppData\Local\Programs\Python\Python35\Lib\site-packages\law_md3" "E:\Office files\Python\Output\section.json" "E:\Office files\Python\Output\train.json" -n 50 -P -T -v "C:\Users\karth\AppData\Local\Programs\Python\Python35\Lib\site-packages\en_vectors_web_lg\en_vectors_web_lg-2.0.0"
dropout_from = 0.9 by default
dropout_to = 0.8 by default
dropout_decay = 2e-05 by default
batch_from = 1 by default
batch_to = 16 by default
batch_compound = 1.001 by default
max_doc_len = 5000 by default
beam_width = 1 by default
beam_density = 0.0 by default
learn_rate = 0.001 by default
optimizer_B1 = 0.9 by default
optimizer_B2 = 0.999 by default
optimizer_eps = 1e-08 by default
L2_penalty = 1e-06 by default
grad_norm_clip = 1.0 by default
parser_hidden_depth = 1 by default
parser_maxout_pieces = 2 by default
token_vector_width = 128 by default
hidden_width = 200 by default
embed_size = 7000 by default
history_feats = 0 by default
history_width = 0 by default
Itn. P.Loss N.Loss UAS NER P. NER R. NER F. Tag % Token %
0 0.000 2310.695 0.000 0.000 0.000 0.000 0.000 100.000 12436.5 0.0
1 0.000 874.889 0.000 0.000 0.000 0.000 0.000 100.000 4132.4 0.0
2 0.000 600.412 0.000 0.000 0.000 0.000 0.000 100.000 11427.5 0.0
3 0.000 578.595 0.000 0.000 0.000 0.000 0.000 100.000 11715.5 0.0
4 0.000 510.519 0.000 0.000 0.000 0.000 0.000 100.000 6186.7 0.0
5 0.000 535.982 0.000 0.000 0.000 0.000 0.000 100.000 5763.3 0.0
6 0.000 461.030 0.000 0.000 0.000 0.000 0.000 100.000 5499.6 0.0
7 0.000 185.822 0.000 0.000 0.000 0.000 0.000 100.000 8562.9 0.0
8 0.000 101.904 0.000 0.000 0.000 0.000 0.000 100.000 9453.6 0.0
9 0.000 98.318 0.000 0.000 0.000 0.000 0.000 100.000 3685.6 0.0
10 0.000 95.995 0.000 0.000 0.000 0.000 0.000 100.000 6371.8 0.0
11 0.000 96.809 0.000 0.000 0.000 0.000 0.000 100.000 3803.3 0.0
12 0.000 93.430 0.000 0.000 0.000 0.000 0.000 100.000 4032.5 0.0
13 0.000 90.223 0.000 0.000 0.000 0.000 0.000 100.000 11027.8 0.0
14 0.000 88.861 0.000 0.000 0.000 0.000 0.000 100.000 4543.7 0.0
15 0.000 70.051 0.000 0.000 0.000 0.000 0.000 100.000 4216.8 0.0
16 0.000 41.527 0.000 0.000 0.000 0.000 0.000 100.000 7564.7 0.0
17 0.000 41.856 0.000 0.000 0.000 0.000 0.000 100.000 3748.8 0.0
94%|########################################################################3 | 3299/3512 [00:06<00:00, 494.13it/s]

Thanks, any help would mean alot

windows 10
python 3.5.4
cmd.exe
spacy v2.0.10

@honnibal
Copy link
Member

honnibal commented Apr 7, 2018

aij-wikiner-wp3-es-dev-iob.zip

I'm pretty sure something's gone wrong with your pre-processing, and your data file is not correct. I've attached an IOB format data file that's likely to be easier to produce. Let's say you extract the file to your /tmp directory. Here's an example command that converts it:

unzip /tmp/aij-wikiner-wp3-es-dev-iob.zip
python -m spacy convert aij-wikiner-wp3-es-dev-iob.zip
spacy convert /tmp/aij-wikiner-wp3-es-dev-iob.iob /tmp

This should give you a file /tmp/aij-wikiner-wp3-es-dev-iob.json , which should be correctly formatted for spaCy. You should then have a look at the .iob file, and try to get your data to match its format.

P.Loss N.Loss UAS NER P. NER R. NER F. Tag % Token %

  • Parser loss (irrelevant for you)
  • NER loss (important -- no should not be 0)
  • Unlabelled attachment score for parser (irrelevant for you)
  • NER Precision on development data
  • NER recall on development data
  • NER F on development data
  • Tag accuracy on development data (irrelevant for you?)
  • Tokenization accuracy on development data (irrelevant if you use the .iob format, which prevents you from learning from incorrectly tokenized text)

@honnibal
Copy link
Member

honnibal commented Apr 7, 2018

And the use of hyper_parameters from the CLI is difficult, can we directly change it in the script of train.py

The hyper-parameters are currently set via the environment variables because passing the configuration through to the arbitrary places they have to be read is difficult, and there's constantly new hyper-parameters to expose.

You could try writing to the os.environ dictionary? It's messy but I think it should work.

@KMohaghegh
Copy link
Author

Thanks honnibal, your comment was very helpful and I mange to train the model. Now it works normal and not sending everything back as an entity. It was a problem in my json file.
Thanks a lot

@ines ines added training Training and updating models feat / ner Feature: Named Entity Recognizer labels Apr 10, 2018
@ines ines closed this as completed Apr 10, 2018
@lock
Copy link

lock bot commented May 10, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 10, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / ner Feature: Named Entity Recognizer training Training and updating models
Projects
None yet
Development

No branches or pull requests

4 participants