Reduce size of language data #4140

polm · 2019-08-18T05:16:05Z

Description

This PR reduces the size of a spaCy installation from ~280MB to ~74MB by packaging language data into gzipped JSON rather than Python source files, addressing concerns collected in #3258. Following this change language data is slightly less than half of the total size of a spaCy installation. Note that this depends on explosion/srsly#9, and requirements.txt will need to be updated after that is merged and released.

Lemmatizer classes have basically two structures: a single lemmatizer.py, or a directory where files divided by part of speech and other considerations are unified in an __init__.py. This change deals with both kinds of lemmatizers, but only compressed files that were individually over ~500kb. Some of these files were dictionaries and some were lists. In languages that use a directory for lemmatizer data, it might be a good idea to put all language data in a single JSON file in the future.

Note that because the data has just been moved from Python to JSON, the spaCy source is not actually smaller. As part of a dist build JSON files are gzipped and only the gzipped files are included for distribution. This also means the actual wheel files or .tar.gz packages haven't changed in size (since they're already zipped).

Gzipped msgpack was also considered but resulted in larger files than JSON.

Types of change

This is an enhancement.

Checklist

I have submitted the spaCy Contributor Agreement.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

18 tests are failing at the moment, but the same 18 tests fail in master so I assume that's unrelated.

Rather than a large dict in Python source, the data is now a big json file. This includes a method for loading the json file, falling back to a compressed file, and an update to MANIFEST.in that excludes json in the spacy/lang directory. This focuses on Turkish specifically because it has the most language data in core.

This covers all lemmatizer.py files of a significant size (>500k or so). Small files were left alone. None of the affected files have logic, so this was pretty straightforward. One unusual thing is that the lemma data for Urdu doesn't seem to be used anywhere. That may require further investigation.

These are the languages that use a lemmatizer directory (rather than a single file) and are larger than English. For most of these languages there were many language data files, in which case only the large ones (>500k or so) were converted to json. It may or may not be a good idea to migrate the remaining Python files to json in the future.

The contents of this file were originally just copied from the Python source, but that used single quotes, so it had to be properly converted to json first.

This covers the json.gz files built as part of distribution.

Currently this gzip data on every build; it works, but it should be changed to only gzip when the source file has been updated.

* Make doc.is_sentenced return True if len(doc) < 2. * Make doc.is_nered return True if len(doc) == 0, for consistency. Closes explosion#3934

* more friendly textcat errors with require_model and require_labels * update thinc version with recent bugfix

…erators (explosion#3949) * Add regression test for issue explosion#3541 * Add comment on bugfix * Remove incorrect test * Un-xfail test

* minimal failing example for Issue explosion#3661 * referenced Issue explosion#3661 instead of Issue explosion#3611 * cleanup

* Add validate option to EntityRuler * Add validate to EntityRuler, passed to Matcher and PhraseMatcher * Add validate to usage and API docs * Update website/docs/usage/rule-based-matching.md Co-Authored-By: Ines Montani <[email protected]> * Update website/docs/usage/rule-based-matching.md Co-Authored-By: Ines Montani <[email protected]>

…n#4097) * fixing vector and lemma attributes after retokenizer.split * fixing unit test with mockup tensor * xp instead of numpy

* Add entry for Blackstone in universe.json Add an entry for the Blackstone project. Checked JSON is valid. * Create ICLRandD.md * Fix indentation (tabs to spaces) It looks like during validation, the JSON file automatically changed spaces to tabs. This caused the diff to show *everything* as changed, which is obviously not true. This hopefully fixes that. * Try to fix formatting for diff * Fix diff Co-authored-by: Ines Montani <[email protected]>

… skip]

* update lang/zh * update lang/zh

* document token ent_kb_id * document span kb_id * update pipeline documentation * prior and context weights as bool's instead * entitylinker api documentation * drop for both models * finish entitylinker documentation * small fixes * documentation for KB * candidate documentation * links to api pages in code * small fix * frequency examples as counts for consistency * consistent documentation about tensors returned by predict * add entity linking to usage 101 * add entity linking infobox and KB section to 101 * entity-linking in linguistic features * small typo corrections * training example and docs for entity_linker * predefined nlp and kb * revert back to similarity encodings for simplicity (for now) * set prior probabilities to 0 when excluded * code clean up * bugfix: deleting kb ID from tokens when entities were removed * refactor train el example to use either model or vocab * pretrain_kb example for example kb generation * add to training docs for KB + EL example scripts * small fixes * error numbering * ensure the language of vocab and nlp stay consistent across serialization * equality with = * avoid conflict in errors file * add error 151 * final adjustements to the train scripts - consistency * update of goldparse documentation * small corrections * push commit * turn kb_creator into CLI script (wip) * proper parameters for training entity vectors * wikidata pipeline split up into two executable scripts * remove context_width * move wikidata scripts in bin directory, remove old dummy script * refine KB script with logs and preprocessing options * small edits * small improvements to logging of EL CLI script

…plosion#4110) * pytest file for issue4104 established * edited default lookup english lemmatizer for spun; fixes issue 4102 * eliminated parameterization and sorted dictionary dependnency in issue 4104 test * added contributor agreement

Missed this when I added the json.

The way gzipped json is loaded/saved in srsly changed a bit.

If a .json.gz file exists and is newer than the corresponding json file, it's not recompressed.

This only affected files >500kb, which was nouns for both languages and the generic lookup table for English.

It's unclear why, but the Norwegian (nb) tokenizer had empty files for adj/adv/noun/verb lemmas. This may have been a result of copying the structure of the English lemmatizer. This removed the files, but still creates the empty sets in the lemmatizer. That may not actually be necessary.

" furthest" and " skilled" - both prefixed with a space - were in the English lookup table. That seems obviously wrong so I have removed them.

The en tokenizer was including the removed _nouns.py file, so that's removed. The fr tokenizer is unusual in that it has a lemmatizer directory with both __init__.py and lemmatizer.py. lemmatizer.py had not been converted to load the json language data, so that was fixed.

polm · 2019-08-18T05:37:16Z

Sorry, I screwed up the git history on this one. I'll re-create my branch and open a new PR shortly.

polm and others added 30 commits August 18, 2019 14:17

Fix id lemmas.json

438cbdf

The contents of this file were originally just copied from the Python source, but that used single quotes, so it had to be properly converted to json first.

Add .json.gz to gitignore

6da699c

This covers the json.gz files built as part of distribution.

Add language data gzip to build process

969c2c6

Currently this gzip data on every build; it works, but it should be changed to only gzip when the source file has been updated.

Return True from doc.is_... when no ambiguity

52ec915

* Make doc.is_sentenced return True if len(doc) < 2. * Make doc.is_nered return True if len(doc) == 0, for consistency. Closes explosion#3934

Rename Binder->DocBox, and improve it.

9568dee

Set version to 2.1.5.dev0

5552f3a

more friendly textcat errors (explosion#3946)

9c199da

* more friendly textcat errors with require_model and require_labels * update thinc version with recent bugfix

Fix _serialize

ab8d80e

Reformat

53acf1c

Tidy up and auto-format

0fedbfa

Fix test

91d8054

Add warning message re Issue explosion#3853

c846c27

💫 Fix issue explosion#3839: Incorrect entity IDs from Matcher with op…

aede1ee

…erators (explosion#3949) * Add regression test for issue explosion#3541 * Add comment on bugfix * Remove incorrect test * Un-xfail test

Add test file for issue (explosion#3625) and spacy contributor agreement

7477d3f

Fix default punctuation rules for hindi text (explosion#3625 explosion)

9e8ac78

Add default encoding utf-8 for test file

967eda3

failing unit test for issue explosion#3869

537f559

tracked the bug down to PreshCounter.inc - still unclear what goes wrong

bf16a1b

counter instead of preshcounter

984b62b

cleanup

28a1f00

Add warning for explosion#3853

d627127

Fix explosion#3853, and add warning

f38f102

Update Thinc version pin

22ef030

fix custom attribute links

309f72f

Update Thinc version pin

fac34fe

Fixing ngram bug (explosion#3953)

1b71661

* minimal failing example for Issue explosion#3661 * referenced Issue explosion#3661 instead of Issue explosion#3611 * cleanup

Set version to v2.1.5

8632299

adrianeboyd and others added 26 commits August 18, 2019 14:21

Adjust docs example [ci skip]

b622240

Require downloaded model in pkg_resources (explosion#4090)

f2f0f56

Update Binder version [ci skip]

d245a21

Add Serbian to languages [ci skip]

f932900

Update README.md [ci skip]

f4a1311

Set version to v2.1.8

480b7c5

Update Binder version [ci skip]

b117a1d

Update lemma and vector information after splitting a token (explosio…

0f28f62

…n#4097) * fixing vector and lemma attributes after retokenizer.split * fixing unit test with mockup tensor * xp instead of numpy

Update universe.json [ci skip]

b95f839

Improve docs on phrase pattern attributes (closes explosion#4100) [ci…

855544b

… skip]

update lang/zh (explosion#4103)

97ce4fe

* update lang/zh * update lang/zh

Create wip.yaml [ci skip]

53a304c

Fix file name [ci skip]

138a5c9

Delete wip.yml [ci skip]

8971aa1

💫 Support displaCy user colors via entry point (explosion#4113)

1a74cb0

Remove Danish lemmatizer.py

59999e1

Missed this when I added the json.

Update to match latest explosion/srsly#9

2b4227a

The way gzipped json is loaded/saved in srsly changed a bit.

Only compress language data if necessary

00e6420

If a .json.gz file exists and is newer than the corresponding json file, it's not recompressed.

Move en/el language data to json

a322fc1

This only affected files >500kb, which was nouns for both languages and the generic lookup table for English.

Remove dubious entries in English lookup.json

ee9609a

" furthest" and " skilled" - both prefixed with a space - were in the English lookup table. That seems obviously wrong so I have removed them.

polm force-pushed the slim-lemmas branch from 386d90b to f7204a9 Compare August 18, 2019 05:23

polm closed this Aug 18, 2019

polm mentioned this pull request Aug 18, 2019

Reduce size of language data #4141

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce size of language data #4140

Reduce size of language data #4140

polm commented Aug 18, 2019

polm commented Aug 18, 2019

Reduce size of language data #4140

Reduce size of language data #4140

Conversation

polm commented Aug 18, 2019

Description

Types of change

Checklist

polm commented Aug 18, 2019