Unclear documentation about the JSON input format for training #2353

JamesMessinger · 2018-05-22T18:00:42Z

Which page or section is this issue related to?

Issue

The comment next to the paragraphs.sentences.tokens.id attribute says "index of the token in the document", but in the sample JSON on the same page, the id attribute seems to be the index of the token in the sentence. Notice that the id attribute resets to zero for tokens in the second sentence.

So which is correct? The sample JSON, or the documentation?

If the sample is correct, then what is the best way to get the index of a token within its sentence? It's easy to get the index within the document:

sents = list(doc.sents)
first_token_of_second_sentence = list(sents[1])[0]
print(first_token_of_second_sentence.i)
>>> 49

but the only way I can find to determine the index of a token within its sentence is to subtract the index of the last token of the previous sentence, then subtract 1, which seems inelegant. Is there a better way?

sents = list(doc.sents)
last_token_of_first_sentence = list(sents[0])[-1]
first_token_of_second_sentence = list(sents[1])[0]
print(first_token_of_second_sentence.i - last_token_of_first_sentence.i - 1)
>>> 0

The text was updated successfully, but these errors were encountered:

JamesMessinger · 2018-05-22T18:36:37Z

I dug through the source code a bit, and it looks like the GoldCorpus class is where the JSON training data gets read. And in that function, it appears that the token ID is, in fact, relative to the sentence, not the document. But it doesn't really seem to matter, because the id attribute in the JSON file is never actually used. Instead, the token id is just calculated on the fly.

Then again... it's also possible that I'm completely mis-understanding what the code is doing. So I'd love a sanity check and/or confirmation from someone else.

ines · 2018-12-08T11:34:17Z

Merging this with #2928 – thanks for your patience on this!

lock · 2019-01-07T12:28:41Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added docs Documentation and website training Training and updating models labels May 25, 2018

ines closed this as completed Dec 8, 2018

lock bot locked as resolved and limited conversation to collaborators Jan 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unclear documentation about the JSON input format for training #2353

Unclear documentation about the JSON input format for training #2353

JamesMessinger commented May 22, 2018

JamesMessinger commented May 22, 2018

ines commented Dec 8, 2018

lock bot commented Jan 7, 2019

Unclear documentation about the JSON input format for training #2353

Unclear documentation about the JSON input format for training #2353

Comments

JamesMessinger commented May 22, 2018

Which page or section is this issue related to?

Issue

JamesMessinger commented May 22, 2018

ines commented Dec 8, 2018

lock bot commented Jan 7, 2019