Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

💫 New JSON helpers, training data internals & CLI rewrite #2932

Merged
merged 48 commits into from
Nov 30, 2018

Conversation

ines
Copy link
Member

@ines ines commented Nov 15, 2018

Related issue: #2928

Description

  • Add new Doc.to_json() method to standardise JSON serialization. This will be the one method used to convert Doc objects to JSON data in spaCy's format.
  • Implement JSON schemas and methods to validate incoming data (training data, model meta etc).
  • Add debug-data command
  • Refactor CLI To use wasabi
  • Use black for auto-formatting
  • Add flake8 config
  • Move all messy UD-related scripts to cli.ud
  • Make converters function that take the opened file and return the converted data (instead of having them handle the IO)

Types of change

enhancement

Checklist

  • I have submitted the spaCy Contributor Agreement.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

Will be replaced with Doc.to_json, which will produce a unified format
Converts Doc objects to JSON using the same unified format as the training data. Method also supports serializing selected custom attributes in the doc._. space.
@ines ines added enhancement Feature requests and improvements 🌙 nightly Discussion and contributions related to nightly builds ⚠️ wip Work in progress training Training and updating models feat / doc Feature: Doc, Span and Token objects labels Nov 15, 2018
@ines ines mentioned this pull request Nov 18, 2018
8 tasks
To be merged into #2932.

## Description
- [x] refactor CLI To use [`wasabi`](https:/ines/wasabi)
- [x] use [`black`](https:/ambv/black) for auto-formatting
- [x] add `flake8` config
- [x] move all messy UD-related scripts to `cli.ud`
- [x] make converters function that take the opened file and return the converted data (instead of having them handle the IO)

### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
@ines ines changed the title 💫 Improve JSON format and training data internals 💫 Improve JSON format, training data internals & CLI Nov 26, 2018
@ines ines mentioned this pull request Nov 29, 2018
8 tasks
ines added a commit that referenced this pull request Nov 30, 2018
<!--- Provide a general summary of your changes in the title. -->

## Description
- [x] Use [`black`](https:/ambv/black) to auto-format all `.py` files.
- [x] Update flake8 config to exclude very large files (lemmatization tables etc.)
- [x] Update code to be compatible with flake8 rules
- [x] Fix various small bugs, inconsistencies and messy stuff in the language data
- [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means)

Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results.

At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https:/features/actions) will let us auto-format pull requests and post comments with relevant linting information.

### Types of change
enhancement, code style

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
@ines ines requested a review from honnibal November 30, 2018 16:54
@ines ines changed the title 💫 Improve JSON format, training data internals & CLI 💫 New JSON helpers, training data internals & CLI rewrite Nov 30, 2018
@ines ines removed the ⚠️ wip Work in progress label Nov 30, 2018
@honnibal honnibal merged commit 37c7c85 into develop Nov 30, 2018
@ines ines deleted the feature/docs-json-training branch November 30, 2018 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / doc Feature: Doc, Span and Token objects 🌙 nightly Discussion and contributions related to nightly builds training Training and updating models
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants