Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unclear documentation about the JSON input format for training #2353

Closed
JamesMessinger opened this issue May 22, 2018 · 3 comments
Closed

Unclear documentation about the JSON input format for training #2353

JamesMessinger opened this issue May 22, 2018 · 3 comments
Labels
docs Documentation and website training Training and updating models

Comments

@JamesMessinger
Copy link
Contributor

Which page or section is this issue related to?

JSON input format for training

Issue

The comment next to the paragraphs.sentences.tokens.id attribute says "index of the token in the document", but in the sample JSON on the same page, the id attribute seems to be the index of the token in the sentence. Notice that the id attribute resets to zero for tokens in the second sentence.

So which is correct? The sample JSON, or the documentation?

If the sample is correct, then what is the best way to get the index of a token within its sentence? It's easy to get the index within the document:

sents = list(doc.sents)
first_token_of_second_sentence = list(sents[1])[0]
print(first_token_of_second_sentence.i)
>>> 49

but the only way I can find to determine the index of a token within its sentence is to subtract the index of the last token of the previous sentence, then subtract 1, which seems inelegant. Is there a better way?

sents = list(doc.sents)
last_token_of_first_sentence = list(sents[0])[-1]
first_token_of_second_sentence = list(sents[1])[0]
print(first_token_of_second_sentence.i - last_token_of_first_sentence.i - 1)
>>> 0
@JamesMessinger
Copy link
Contributor Author

I dug through the source code a bit, and it looks like the GoldCorpus class is where the JSON training data gets read. And in that function, it appears that the token ID is, in fact, relative to the sentence, not the document. But it doesn't really seem to matter, because the id attribute in the JSON file is never actually used. Instead, the token id is just calculated on the fly.

Then again... it's also possible that I'm completely mis-understanding what the code is doing. So I'd love a sanity check and/or confirmation from someone else.

@ines ines added docs Documentation and website training Training and updating models labels May 25, 2018
@ines
Copy link
Member

ines commented Dec 8, 2018

Merging this with #2928 – thanks for your patience on this!

@ines ines closed this as completed Dec 8, 2018
@lock
Copy link

lock bot commented Jan 7, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Jan 7, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
docs Documentation and website training Training and updating models
Projects
None yet
Development

No branches or pull requests

2 participants