-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce size of language data #4140
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Rather than a large dict in Python source, the data is now a big json file. This includes a method for loading the json file, falling back to a compressed file, and an update to MANIFEST.in that excludes json in the spacy/lang directory. This focuses on Turkish specifically because it has the most language data in core.
This covers all lemmatizer.py files of a significant size (>500k or so). Small files were left alone. None of the affected files have logic, so this was pretty straightforward. One unusual thing is that the lemma data for Urdu doesn't seem to be used anywhere. That may require further investigation.
These are the languages that use a lemmatizer directory (rather than a single file) and are larger than English. For most of these languages there were many language data files, in which case only the large ones (>500k or so) were converted to json. It may or may not be a good idea to migrate the remaining Python files to json in the future.
The contents of this file were originally just copied from the Python source, but that used single quotes, so it had to be properly converted to json first.
This covers the json.gz files built as part of distribution.
Currently this gzip data on every build; it works, but it should be changed to only gzip when the source file has been updated.
* Make doc.is_sentenced return True if len(doc) < 2. * Make doc.is_nered return True if len(doc) == 0, for consistency. Closes explosion#3934
* more friendly textcat errors with require_model and require_labels * update thinc version with recent bugfix
…erators (explosion#3949) * Add regression test for issue explosion#3541 * Add comment on bugfix * Remove incorrect test * Un-xfail test
* minimal failing example for Issue explosion#3661 * referenced Issue explosion#3661 instead of Issue explosion#3611 * cleanup
* Add validate option to EntityRuler * Add validate to EntityRuler, passed to Matcher and PhraseMatcher * Add validate to usage and API docs * Update website/docs/usage/rule-based-matching.md Co-Authored-By: Ines Montani <[email protected]> * Update website/docs/usage/rule-based-matching.md Co-Authored-By: Ines Montani <[email protected]>
…n#4097) * fixing vector and lemma attributes after retokenizer.split * fixing unit test with mockup tensor * xp instead of numpy
* Add entry for Blackstone in universe.json Add an entry for the Blackstone project. Checked JSON is valid. * Create ICLRandD.md * Fix indentation (tabs to spaces) It looks like during validation, the JSON file automatically changed spaces to tabs. This caused the diff to show *everything* as changed, which is obviously not true. This hopefully fixes that. * Try to fix formatting for diff * Fix diff Co-authored-by: Ines Montani <[email protected]>
* update lang/zh * update lang/zh
* document token ent_kb_id * document span kb_id * update pipeline documentation * prior and context weights as bool's instead * entitylinker api documentation * drop for both models * finish entitylinker documentation * small fixes * documentation for KB * candidate documentation * links to api pages in code * small fix * frequency examples as counts for consistency * consistent documentation about tensors returned by predict * add entity linking to usage 101 * add entity linking infobox and KB section to 101 * entity-linking in linguistic features * small typo corrections * training example and docs for entity_linker * predefined nlp and kb * revert back to similarity encodings for simplicity (for now) * set prior probabilities to 0 when excluded * code clean up * bugfix: deleting kb ID from tokens when entities were removed * refactor train el example to use either model or vocab * pretrain_kb example for example kb generation * add to training docs for KB + EL example scripts * small fixes * error numbering * ensure the language of vocab and nlp stay consistent across serialization * equality with = * avoid conflict in errors file * add error 151 * final adjustements to the train scripts - consistency * update of goldparse documentation * small corrections * push commit * turn kb_creator into CLI script (wip) * proper parameters for training entity vectors * wikidata pipeline split up into two executable scripts * remove context_width * move wikidata scripts in bin directory, remove old dummy script * refine KB script with logs and preprocessing options * small edits * small improvements to logging of EL CLI script
…plosion#4110) * pytest file for issue4104 established * edited default lookup english lemmatizer for spun; fixes issue 4102 * eliminated parameterization and sorted dictionary dependnency in issue 4104 test * added contributor agreement
Missed this when I added the json.
The way gzipped json is loaded/saved in srsly changed a bit.
If a .json.gz file exists and is newer than the corresponding json file, it's not recompressed.
This only affected files >500kb, which was nouns for both languages and the generic lookup table for English.
It's unclear why, but the Norwegian (nb) tokenizer had empty files for adj/adv/noun/verb lemmas. This may have been a result of copying the structure of the English lemmatizer. This removed the files, but still creates the empty sets in the lemmatizer. That may not actually be necessary.
" furthest" and " skilled" - both prefixed with a space - were in the English lookup table. That seems obviously wrong so I have removed them.
The en tokenizer was including the removed _nouns.py file, so that's removed. The fr tokenizer is unusual in that it has a lemmatizer directory with both __init__.py and lemmatizer.py. lemmatizer.py had not been converted to load the json language data, so that was fixed.
Sorry, I screwed up the git history on this one. I'll re-create my branch and open a new PR shortly. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR reduces the size of a spaCy installation from ~280MB to ~74MB by packaging language data into gzipped JSON rather than Python source files, addressing concerns collected in #3258. Following this change language data is slightly less than half of the total size of a spaCy installation. Note that this depends on explosion/srsly#9, and
requirements.txt
will need to be updated after that is merged and released.Lemmatizer classes have basically two structures: a single
lemmatizer.py
, or a directory where files divided by part of speech and other considerations are unified in an__init__.py
. This change deals with both kinds of lemmatizers, but only compressed files that were individually over ~500kb. Some of these files were dictionaries and some were lists. In languages that use a directory for lemmatizer data, it might be a good idea to put all language data in a single JSON file in the future.Note that because the data has just been moved from Python to JSON, the spaCy source is not actually smaller. As part of a dist build JSON files are gzipped and only the gzipped files are included for distribution. This also means the actual wheel files or .tar.gz packages haven't changed in size (since they're already zipped).
Gzipped msgpack was also considered but resulted in larger files than JSON.
Types of change
This is an enhancement.
Checklist
18 tests are failing at the moment, but the same 18 tests fail in master so I assume that's unrelated.