Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add proposal for reading jsonlines #4458

Closed
wants to merge 1 commit into from

Conversation

akornilo
Copy link
Contributor

@akornilo akornilo commented Oct 16, 2019

Allow for jsonlines input for training/creating a gold corpus

Description

The input formats for training in the CLI are confusing (see the later comments in my issue here - #4452)

When jsonlines is used as the file extension (*.jsonl) - the format that the code expects is not actually the JSON input format (but the tuples from the tuples_from_json function). This PR covers 2 things:

  • Clarifying the documentation on the webpage
  • My changes to the gold.pyx file allow jsonlines of the input format to be ingested. Thus the input files can either be [doc1, doc2] for doc1 \n doc2.
    The modified test checks for both formats.

Types of change

enhancement

Checklist

  • I have submitted the spaCy Contributor Agreement.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@adrianeboyd
Copy link
Contributor

This is definitely an area that needs improvement, but I don't think it's a good idea to have .json files that aren't actually valid JSON. This will only lead to even more confusion in the future.

I think a better alternative would be to extend the JSONL handling to support both the training JSON format and the tuple JSON format.

The train CLI also supports multiple JSON files as input (it recursively searches for all JSON files in a provided directory), so this is another option when dealing with larger training corpora, since the training JSON files do get pretty unwieldy.

@akornilo
Copy link
Contributor Author

That is a reasonable concern. My concern comes from the fact that the current _json_Iterate function requires the example to be wrapped in a list. So even if I have one example per file, I will need to do [{example}] instead of just {example}. Maybe that isnt a big deal though.

Maybe handling both forms of "jsonl" is the right answer - the current behavior is confusing..so either fix can work. I will throw up the alternative format.

@adrianeboyd
Copy link
Contributor

At this point, I think you can detect whether the JSONL line is a dict or a tuple and process accordingly. This will get trickier if/when a new training format is introduced, but I think a very simple check would be okay for right now.

What are your concerns/needs that require JSONL instead of JSON here? We're definitely interested in feedback in this area and understanding what people's needs are.

The internal training corpora used for spacy are large enough to be a bit unwieldy in git, so I'm currently working on splitting them up into individual files for each document. Each document will be a dict in a list like [{example}], which maybe isn't as elegant as it could be, but it doesn't seem like a major problem either.

@akornilo
Copy link
Contributor Author

I think jsonlines files are easier to work with. But I think that the current formulation is fine...it just took me a while to figure out which types of inputs work and don't work. On a semi-related note, I think the json2jsonl converter has some issues, but I need to spend some more time debugging that.

@adrianeboyd
Copy link
Contributor

Looking at the docs again, this is even more confusing than I remembered. I've mostly been working with JSON not JSONL, so I hadn't looked at the details here. I think it would make sense to close this and start a new issue to get the JSONL bits straightened out, which would solve a few of your problems. (The spacy convert default is JSON not JSONL unlike what the docs say, the train CLI can't read in the JSONL format that spacy convert produces, etc.)

@svlandeg svlandeg added docs Documentation and website feat / cli Feature: Command-line interface labels Oct 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation and website feat / cli Feature: Command-line interface
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants