Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'list index out of range' error for some batches when using minibatch #2946

Closed
KavyaGujjala opened this issue Nov 19, 2018 · 3 comments
Closed
Labels
feat / tokenizer Feature: Tokenizer training Training and updating models usage General spaCy usage

Comments

@KavyaGujjala
Copy link

How to reproduce the behaviour

Error looks like this:
error is: list index out of range
error text
("The protest comes on the eve of the annual conference of Britain 's ruling Labor Party in the southern English seaside resort of Brighton.", "The International Atomic Energy Agency is to hold second day of talks in Vienna Wednesday on how to respond to Iran 's resumption of low-level uranium conversion.", 'Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country.', "The party is divided over Britain 's participation in the Iraq conflict and the continued deployment of 8,500 British troops in that country.")

error annotations
({'tags': ['DT', 'NN', 'VBZ', 'IN', 'DT', 'NN', 'IN', 'DT', 'JJ', 'NN', 'IN', 'NNP', 'POS', 'VBG', 'NNP', 'NNP', 'IN', 'DT', 'JJ', 'JJ', 'NN', 'NN', 'IN', 'NNP', '.']}, {'tags': ['DT', 'NNP', 'NNP', 'NNP', 'NNP', 'VBZ', 'TO', 'VB', 'JJ', 'NN', 'IN', 'NNS', 'IN', 'NNP', 'NNP', 'IN', 'WRB', 'TO', 'VB', 'TO', 'NNP', 'POS', 'NN', 'IN', 'JJ', 'NN', 'NN', '.']}, {'tags': ['NNS', 'IN', 'NNS', 'VBP', 'VBN', 'IN', 'NNP', 'TO', 'VB', 'DT', 'NN', 'IN', 'NNP', 'CC', 'VB', 'DT', 'NN', 'IN', 'JJ', 'NNS', 'IN', 'DT', 'NN', '.']}, {'tags': ['DT', 'NN', 'VBZ', 'VBN', 'IN', 'NNP', 'POS', 'NN', 'IN', 'DT', 'NNP', 'NN', 'CC', 'DT', 'JJ', 'NN', 'IN', 'CD', 'JJ', 'NNS', 'IN', 'DT', 'NN', '.']})

error is: list index out of range
error text
('They marched from the Houses of Parliament to a rally in Hyde Park.', 'Families of soldiers killed in the conflict joined the protesters who carried banners with such slogans as" Bush Number One Terrorist" and" Stop the Bombings."', 'Police put the number of marchers at 10,000 while organizers claimed it was 1,00,000.', 'The London march came ahead of anti-war protests today in other cities, including Rome, Paris, and Madrid.')

error annotations
({'tags': ['PRP', 'VBD', 'IN', 'DT', 'NNS', 'IN', 'NN', 'TO', 'DT', 'NN', 'IN', 'NNP', 'NNP', '.']}, {'tags': ['NNS', 'IN', 'NNS', 'VBN', 'IN', 'DT', 'NN', 'VBD', 'DT', 'NNS', 'WP', 'VBD', 'NNS', 'IN', 'JJ', 'NNS', 'IN', '', 'NNP', 'NN', 'CD', 'NN', '', 'CC', '', 'VB', 'DT', 'NNS', '.', '']}, {'tags': ['NNS', 'VBD', 'DT', 'NN', 'IN', 'NNS', 'IN', 'CD', 'IN', 'NNS', 'VBD', 'PRP', 'VBD', 'CD', '.']}, {'tags': ['DT', 'NNP', 'NN', 'VBD', 'RB', 'IN', 'JJ', 'NNS', 'NN', 'IN', 'JJ', 'NNS', ',', 'VBG', 'NNP', ',', 'NNP', ',', 'CC', 'NNP', '.']})

NO ERROR BATCH
no error text
('Iranian officials say they expect to get access to sealed sensitive parts of the plant Wednesday, after an IAEA surveillance system begins functioning.', 'Iran this week restarted parts of the conversion process at its Isfahan nuclear plant.')

no error annotations
({'tags': ['JJ', 'NNS', 'VBP', 'PRP', 'VBP', 'TO', 'VB', 'NN', 'TO', 'JJ', 'JJ', 'NNS', 'IN', 'DT', 'NN', 'NNP', ',', 'IN', 'DT', 'NNP', 'NN', 'NN', 'VBZ', 'VBG', '.']}, {'tags': ['NNP', 'DT', 'NN', 'VBD', 'NNS', 'IN', 'DT', 'NN', 'NN', 'IN', 'PRP$', 'NNP', 'JJ', 'NN', '.']})

Your Environment

  • Operating System: windows 10
  • Python Version Used: 3.6
  • spaCy Version Used: 2.0.10
  • Environment Information:
@KavyaGujjala
Copy link
Author

Okay, I found out it's because of the hyphen seperated words like low-level, anti-war .
Spacy tokenizes these as three words where as I gave only one tag for that.
How to resolve this?
Can someone help me through this?

@ines ines added usage General spaCy usage training Training and updating models feat / tokenizer Feature: Tokenizer labels Nov 26, 2018
@ines
Copy link
Member

ines commented Nov 26, 2018

One option would be to change the tokenization by customising the tokenizer. The tokenization rules will be serialized with your model, so your rules will be included when you save out the trained/updated model.

Alternatively, you could also adjust your data and update the tags. Since there's a clear pattern here, you should probably be able to do this programmatically (split text with spaCy, find hyphenated tokens, check your tags at position token.i and add two more tags).

Finally, when updating the model, you can also pass in Doc and GoldParse objects instead of texts and annotations. The GoldParse can be created with a words keyword argument that specifies the gold-standard tokenization, so you'll be able to train from data that doesn't match how spaCy would normally tokenize a string. However, keep in mind that this can also lead to worse results, since your model's tokenizer will still never produce those tokens. In spaCy v2.1.x, the parser will be able to learn to merge tokens (which is important for languages like Chinese) – this would let you work around this problem, because the parser could learn that "anti-war" is one token, even if the tokenizer previously split it into 3.

@ines ines closed this as completed Nov 26, 2018
@lock
Copy link

lock bot commented Dec 26, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Dec 26, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / tokenizer Feature: Tokenizer training Training and updating models usage General spaCy usage
Projects
None yet
Development

No branches or pull requests

2 participants