Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

💫 Raise error if annotation dict in simple training style has unexpected keys #4074

Closed
ines opened this issue Aug 3, 2019 · 4 comments
Closed
Labels
enhancement Feature requests and improvements help wanted (easy) Contributions welcome! (also suited for spaCy beginners) help wanted Contributions welcome! training Training and updating models

Comments

@ines
Copy link
Member

ines commented Aug 3, 2019

This is a really easy mistake to make and can lead to very frustrating debugging experiences. It just happened to me again and I've debugged people's training code in the past where this issue was the root cause of the model not learning anything.

Consider the following example:

texts = ["hello world", "this is a text"]
cats = [{"LABEL": True}, {"LABEL": False}]
nlp.update(texts, cats)

Spot the problem? It's here:

texts = ["hello world", "this is a text"]
- cats = [{"LABEL": True}, {"LABEL": False}]
+ cats = [{"cats": {"LABEL": True}}, {"cats": {"LABEL": False}}]
nlp.update(texts, cats)

Requiring a dict like that makes sense because it allows you to train multiple things at once (entities and text categories for instance). However, we currently do not raise an error if no expected top-level keys are present and instead just quietly ignore the additional keys. Which means that in the first example, the model would have been updated with nothing.

Instead, spaCy should raise an error like "Trying to call nlp.update with annotation type 'LABEL'. Expected top-level keys 'words', 'tags', 'heads', 'deps', 'entities' or 'cats'. Got: {"LABEL": True}."

@ines ines added enhancement Feature requests and improvements help wanted Contributions welcome! help wanted (easy) Contributions welcome! (also suited for spaCy beginners) training Training and updating models labels Aug 3, 2019
@jenojp
Copy link
Contributor

jenojp commented Aug 3, 2019

Hi @ines , I took a stab at this one. I put in a check that checks to see if any of the keys are in the list of expected top-level keys and raises an error if not. So this ensures:

annots = [{"LABEL": True}, {"LABEL": False}] does not work
annots = [{"cats": {"LABEL": True}}, {"cats": {"LABEL": False}}] works
annots = [{"cats": {"LABEL": True}, "other":{"POSITIVE": 1.0}}, {"cats": {"LABEL": False}, "other":{"POSITIVE": 1.0}}] also works though

Is this the behavior you were expecting? I can go ahead submit a pull request if so.

jenojp added a commit to jenojp/spaCy that referenced this issue Aug 3, 2019
@ines
Copy link
Member Author

ines commented Aug 4, 2019

@jenojp Nice, thanks a lot for taking this on so quickly! 👍 The fix looks good, so yes, please go ahead and submit the PR.

annots = [{"cats": {"LABEL": True}, "other":{"POSITIVE": 1.0}}, {"cats": {"LABEL": False}, "other":{"POSITIVE": 1.0}}] also works though

I think we could maybe even be stricter here and check for all keys that are not in our list of expected keys. And then raise an error if any unexpected values are found. For example, something like this:

expected_keys = ("words", "tags", "heads", "deps", "entities", "cats")
unexpected_keys = [k for k in gold if k not in expected_keys]
if unexpected_keys:
    raise ValueError  # etc.

In the error, we could then only print the unexpected_keys, which should give the user enough hints without clogging up their terminal. (Because in theory, you might be training with tons of categories, entities or really long texts with per-token tags, which could produce really long errors.)

@jenojp
Copy link
Contributor

jenojp commented Aug 4, 2019

Awesome, thanks for clarifying! Just submitted the PR.

ines pushed a commit that referenced this issue Aug 6, 2019
…d keys #4074 (#4079)

* adding enhancement #4074.

* modified behavior to strictly require top level dictionary keys - issue #4074

* pass expected keys to error message and add links as expected top level key
@ines ines closed this as completed Aug 6, 2019
polm pushed a commit to polm/spaCy that referenced this issue Aug 18, 2019
…d keys explosion#4074 (explosion#4079)

* adding enhancement explosion#4074.

* modified behavior to strictly require top level dictionary keys - issue explosion#4074

* pass expected keys to error message and add links as expected top level key
@lock
Copy link

lock bot commented Sep 5, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Sep 5, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement Feature requests and improvements help wanted (easy) Contributions welcome! (also suited for spaCy beginners) help wanted Contributions welcome! training Training and updating models
Projects
None yet
Development

No branches or pull requests

2 participants