Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce size of language data #4140

Closed
wants to merge 153 commits into from
Closed

Reduce size of language data #4140

wants to merge 153 commits into from

Conversation

polm
Copy link
Contributor

@polm polm commented Aug 18, 2019

Description

This PR reduces the size of a spaCy installation from ~280MB to ~74MB by packaging language data into gzipped JSON rather than Python source files, addressing concerns collected in #3258. Following this change language data is slightly less than half of the total size of a spaCy installation. Note that this depends on explosion/srsly#9, and requirements.txt will need to be updated after that is merged and released.

Lemmatizer classes have basically two structures: a single lemmatizer.py, or a directory where files divided by part of speech and other considerations are unified in an __init__.py. This change deals with both kinds of lemmatizers, but only compressed files that were individually over ~500kb. Some of these files were dictionaries and some were lists. In languages that use a directory for lemmatizer data, it might be a good idea to put all language data in a single JSON file in the future.

Note that because the data has just been moved from Python to JSON, the spaCy source is not actually smaller. As part of a dist build JSON files are gzipped and only the gzipped files are included for distribution. This also means the actual wheel files or .tar.gz packages haven't changed in size (since they're already zipped).

Gzipped msgpack was also considered but resulted in larger files than JSON.

Types of change

This is an enhancement.

Checklist

  • I have submitted the spaCy Contributor Agreement.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

18 tests are failing at the moment, but the same 18 tests fail in master so I assume that's unrelated.

polm and others added 30 commits August 18, 2019 14:17
Rather than a large dict in Python source, the data is now a big json
file. This includes a method for loading the json file, falling back to
a compressed file, and an update to MANIFEST.in that excludes json in
the spacy/lang directory.

This focuses on Turkish specifically because it has the most language
data in core.
This covers all lemmatizer.py files of a significant size (>500k or so).
Small files were left alone.

None of the affected files have logic, so this was pretty
straightforward.

One unusual thing is that the lemma data for Urdu doesn't seem to be
used anywhere. That may require further investigation.
These are the languages that use a lemmatizer directory (rather than a
single file) and are larger than English.

For most of these languages there were many language data files, in
which case only the large ones (>500k or so) were converted to json. It
may or may not be a good idea to migrate the remaining Python files to
json in the future.
The contents of this file were originally just copied from the Python
source, but that used single quotes, so it had to be properly converted
to json first.
This covers the json.gz files built as part of distribution.
Currently this gzip data on every build; it works, but it should be
changed to only gzip when the source file has been updated.
* Make doc.is_sentenced return True if len(doc) < 2.

* Make doc.is_nered return True if len(doc) == 0, for consistency.

Closes explosion#3934
* more friendly textcat errors with require_model and require_labels

* update thinc version with recent bugfix
…erators (explosion#3949)

* Add regression test for issue explosion#3541

* Add comment on bugfix

* Remove incorrect test

* Un-xfail test
* minimal failing example for Issue explosion#3661

* referenced Issue explosion#3661 instead of Issue explosion#3611

* cleanup
adrianeboyd and others added 26 commits August 18, 2019 14:21
* Add validate option to EntityRuler

* Add validate to EntityRuler, passed to Matcher and PhraseMatcher

* Add validate to usage and API docs

* Update website/docs/usage/rule-based-matching.md

Co-Authored-By: Ines Montani <[email protected]>

* Update website/docs/usage/rule-based-matching.md

Co-Authored-By: Ines Montani <[email protected]>
…n#4097)

* fixing vector and lemma attributes after retokenizer.split

* fixing unit test with mockup tensor

* xp instead of numpy
* Add entry for Blackstone in universe.json

Add an entry for the Blackstone project. Checked JSON is valid.

* Create ICLRandD.md

* Fix indentation (tabs to spaces)

It looks like during validation, the JSON file automatically changed spaces to tabs. This caused the diff to show *everything* as changed, which is obviously not true. This hopefully fixes that.

* Try to fix formatting for diff

* Fix diff


Co-authored-by: Ines Montani <[email protected]>
* update lang/zh

* update lang/zh
* document token ent_kb_id

* document span kb_id

* update pipeline documentation

* prior and context weights as bool's instead

* entitylinker api documentation

* drop for both models

* finish entitylinker documentation

* small fixes

* documentation for KB

* candidate documentation

* links to api pages in code

* small fix

* frequency examples as counts for consistency

* consistent documentation about tensors returned by predict

* add entity linking to usage 101

* add entity linking infobox and KB section to 101

* entity-linking in linguistic features

* small typo corrections

* training example and docs for entity_linker

* predefined nlp and kb

* revert back to similarity encodings for simplicity (for now)

* set prior probabilities to 0 when excluded

* code clean up

* bugfix: deleting kb ID from tokens when entities were removed

* refactor train el example to use either model or vocab

* pretrain_kb example for example kb generation

* add to training docs for KB + EL example scripts

* small fixes

* error numbering

* ensure the language of vocab and nlp stay consistent across serialization

* equality with =

* avoid conflict in errors file

* add error 151

* final adjustements to the train scripts - consistency

* update of goldparse documentation

* small corrections

* push commit

* turn kb_creator into CLI script (wip)

* proper parameters for training entity vectors

* wikidata pipeline split up into two executable scripts

* remove context_width

* move wikidata scripts in bin directory, remove old dummy script

* refine KB script with logs and preprocessing options

* small edits

* small improvements to logging of EL CLI script
…plosion#4110)

* pytest file for issue4104 established

* edited default lookup english lemmatizer for spun; fixes issue 4102

* eliminated parameterization and sorted dictionary dependnency in issue 4104 test

* added contributor agreement
Missed this when I added the json.
The way gzipped json is loaded/saved in srsly changed a bit.
If a .json.gz file exists and is newer than the corresponding json file,
it's not recompressed.
This only affected files >500kb, which was nouns for both languages and
the generic lookup table for English.
It's unclear why, but the Norwegian (nb) tokenizer had empty files for
adj/adv/noun/verb lemmas. This may have been a result of copying the
structure of the English lemmatizer.

This removed the files, but still creates the empty sets in the
lemmatizer. That may not actually be necessary.
" furthest" and " skilled" - both prefixed with a space - were in the
English lookup table. That seems obviously wrong so I have removed them.
The en tokenizer was including the removed _nouns.py file, so that's
removed.

The fr tokenizer is unusual in that it has a lemmatizer directory with
both __init__.py and lemmatizer.py. lemmatizer.py had not been converted
to load the json language data, so that was fixed.
@polm
Copy link
Contributor Author

polm commented Aug 18, 2019

Sorry, I screwed up the git history on this one. I'll re-create my branch and open a new PR shortly.

@polm polm closed this Aug 18, 2019
@polm polm mentioned this pull request Aug 18, 2019
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.