-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unclear documentation about the JSON input format for training #2353
Comments
I dug through the source code a bit, and it looks like the GoldCorpus class is where the JSON training data gets read. And in that function, it appears that the token ID is, in fact, relative to the sentence, not the document. But it doesn't really seem to matter, because the Then again... it's also possible that I'm completely mis-understanding what the code is doing. So I'd love a sanity check and/or confirmation from someone else. |
Merging this with #2928 – thanks for your patience on this! |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Which page or section is this issue related to?
JSON input format for training
Issue
The comment next to the
paragraphs.sentences.tokens.id
attribute says "index of the token in the document", but in the sample JSON on the same page, theid
attribute seems to be the index of the token in the sentence. Notice that theid
attribute resets to zero for tokens in the second sentence.So which is correct? The sample JSON, or the documentation?
If the sample is correct, then what is the best way to get the index of a token within its sentence? It's easy to get the index within the document:
but the only way I can find to determine the index of a token within its sentence is to subtract the index of the last token of the previous sentence, then subtract 1, which seems inelegant. Is there a better way?
The text was updated successfully, but these errors were encountered: