'list index out of range' error for some batches when using minibatch #2946

KavyaGujjala · 2018-11-19T07:45:26Z

How to reproduce the behaviour

Error looks like this:
error is: list index out of range
error text
("The protest comes on the eve of the annual conference of Britain 's ruling Labor Party in the southern English seaside resort of Brighton.", "The International Atomic Energy Agency is to hold second day of talks in Vienna Wednesday on how to respond to Iran 's resumption of low-level uranium conversion.", 'Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country.', "The party is divided over Britain 's participation in the Iraq conflict and the continued deployment of 8,500 British troops in that country.")

error annotations
({'tags': ['DT', 'NN', 'VBZ', 'IN', 'DT', 'NN', 'IN', 'DT', 'JJ', 'NN', 'IN', 'NNP', 'POS', 'VBG', 'NNP', 'NNP', 'IN', 'DT', 'JJ', 'JJ', 'NN', 'NN', 'IN', 'NNP', '.']}, {'tags': ['DT', 'NNP', 'NNP', 'NNP', 'NNP', 'VBZ', 'TO', 'VB', 'JJ', 'NN', 'IN', 'NNS', 'IN', 'NNP', 'NNP', 'IN', 'WRB', 'TO', 'VB', 'TO', 'NNP', 'POS', 'NN', 'IN', 'JJ', 'NN', 'NN', '.']}, {'tags': ['NNS', 'IN', 'NNS', 'VBP', 'VBN', 'IN', 'NNP', 'TO', 'VB', 'DT', 'NN', 'IN', 'NNP', 'CC', 'VB', 'DT', 'NN', 'IN', 'JJ', 'NNS', 'IN', 'DT', 'NN', '.']}, {'tags': ['DT', 'NN', 'VBZ', 'VBN', 'IN', 'NNP', 'POS', 'NN', 'IN', 'DT', 'NNP', 'NN', 'CC', 'DT', 'JJ', 'NN', 'IN', 'CD', 'JJ', 'NNS', 'IN', 'DT', 'NN', '.']})

error is: list index out of range
error text
('They marched from the Houses of Parliament to a rally in Hyde Park.', 'Families of soldiers killed in the conflict joined the protesters who carried banners with such slogans as" Bush Number One Terrorist" and" Stop the Bombings."', 'Police put the number of marchers at 10,000 while organizers claimed it was 1,00,000.', 'The London march came ahead of anti-war protests today in other cities, including Rome, Paris, and Madrid.')

error annotations
({'tags': ['PRP', 'VBD', 'IN', 'DT', 'NNS', 'IN', 'NN', 'TO', 'DT', 'NN', 'IN', 'NNP', 'NNP', '.']}, {'tags': ['NNS', 'IN', 'NNS', 'VBN', 'IN', 'DT', 'NN', 'VBD', 'DT', 'NNS', 'WP', 'VBD', 'NNS', 'IN', 'JJ', 'NNS', 'IN', '', 'NNP', 'NN', 'CD', 'NN', '', 'CC', '', 'VB', 'DT', 'NNS', '.', '']}, {'tags': ['NNS', 'VBD', 'DT', 'NN', 'IN', 'NNS', 'IN', 'CD', 'IN', 'NNS', 'VBD', 'PRP', 'VBD', 'CD', '.']}, {'tags': ['DT', 'NNP', 'NN', 'VBD', 'RB', 'IN', 'JJ', 'NNS', 'NN', 'IN', 'JJ', 'NNS', ',', 'VBG', 'NNP', ',', 'NNP', ',', 'CC', 'NNP', '.']})

NO ERROR BATCH
no error text
('Iranian officials say they expect to get access to sealed sensitive parts of the plant Wednesday, after an IAEA surveillance system begins functioning.', 'Iran this week restarted parts of the conversion process at its Isfahan nuclear plant.')

no error annotations
({'tags': ['JJ', 'NNS', 'VBP', 'PRP', 'VBP', 'TO', 'VB', 'NN', 'TO', 'JJ', 'JJ', 'NNS', 'IN', 'DT', 'NN', 'NNP', ',', 'IN', 'DT', 'NNP', 'NN', 'NN', 'VBZ', 'VBG', '.']}, {'tags': ['NNP', 'DT', 'NN', 'VBD', 'NNS', 'IN', 'DT', 'NN', 'NN', 'IN', 'PRP$', 'NNP', 'JJ', 'NN', '.']})

Your Environment

Operating System: windows 10
Python Version Used: 3.6
spaCy Version Used: 2.0.10
Environment Information:

The text was updated successfully, but these errors were encountered:

KavyaGujjala · 2018-11-19T10:03:14Z

Okay, I found out it's because of the hyphen seperated words like low-level, anti-war .
Spacy tokenizes these as three words where as I gave only one tag for that.
How to resolve this?
Can someone help me through this?

ines · 2018-11-26T12:44:48Z

One option would be to change the tokenization by customising the tokenizer. The tokenization rules will be serialized with your model, so your rules will be included when you save out the trained/updated model.

Alternatively, you could also adjust your data and update the tags. Since there's a clear pattern here, you should probably be able to do this programmatically (split text with spaCy, find hyphenated tokens, check your tags at position token.i and add two more tags).

Finally, when updating the model, you can also pass in Doc and GoldParse objects instead of texts and annotations. The GoldParse can be created with a words keyword argument that specifies the gold-standard tokenization, so you'll be able to train from data that doesn't match how spaCy would normally tokenize a string. However, keep in mind that this can also lead to worse results, since your model's tokenizer will still never produce those tokens. In spaCy v2.1.x, the parser will be able to learn to merge tokens (which is important for languages like Chinese) – this would let you work around this problem, because the parser could learn that "anti-war" is one token, even if the tokenizer previously split it into 3.

lock · 2018-12-26T13:04:46Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added usage General spaCy usage training Training and updating models feat / tokenizer Feature: Tokenizer labels Nov 26, 2018

ines closed this as completed Nov 26, 2018

lock bot locked as resolved and limited conversation to collaborators Dec 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'list index out of range' error for some batches when using minibatch #2946

'list index out of range' error for some batches when using minibatch #2946

KavyaGujjala commented Nov 19, 2018

KavyaGujjala commented Nov 19, 2018

ines commented Nov 26, 2018

lock bot commented Dec 26, 2018

'list index out of range' error for some batches when using minibatch #2946

'list index out of range' error for some batches when using minibatch #2946

Comments

KavyaGujjala commented Nov 19, 2018

How to reproduce the behaviour

Your Environment

KavyaGujjala commented Nov 19, 2018

ines commented Nov 26, 2018

lock bot commented Dec 26, 2018