Release v2.2.2: Multiprocessing, future APIs, Luxembourgish base support & simpler GPU install · explosion/spaCy

✨ New features and improvements

NEW: Support multiprocessing in nlp.pipe via the n_process argument (Python 3 only).
Base language support for Luxembourgish.
Add noun chunks iterator for Swedish.
Retrained models for Greek, Norwegian Bokmål and Lithuanian that now correctly support parser-based sentence segmentation.
Repackaged models for Greek and German with improved lookup tables via spacy-lookups-data.
Add warning in debug-data for low sentences per doc ratio.
Improve checks and errors related to ill-formed IOB input in convert and debug-data CLI.
Support training dict format as JSONL.
Make EntityRuler ID resolution 2× faster and support "id" in patterns to set Token.ent_id.
Improve rendering of named entity spans in displacy for RTL languages.
Update Thinc to ditch thinc_gpu_ops for simpler GPU install.
Support Mish activation in spacy pretrain.
Add forwards-compatible support for new Language.disable_pipes API, which will become
the default in the future. The method can now also take a list of component names as its first argument (instead of a variable number of arguments).
```
- disabled = nlp.disable_pipes("tagger", "parser")
+ disabled = nlp.disable_pipes(["tagger", "parser"])
```
Add forwards-compatible support for new Matcher.add and PhraseMatcher.add API, which will become the default in the future. The patterns are now the second argument and a list (instead of a variable number of arguments). The on_match callback becomes an optional keyword argument.
```
patterns = [[{"TEXT": "Google"}, {"TEXT": "Now"}], [{"TEXT": "GoogleNow"}]]
- matcher.add("GoogleNow", None, *patterns)
+ matcher.add("GoogleNow", patterns)
- matcher.add("GoogleNow", on_match, *patterns)
+ matcher.add("GoogleNow", patterns, on_match=on_match)
```
Add new and improved tokenization alignment in gold.align behind a feature flag. The new alignment may produce backwards-incompatible results, so it won't be enabled by default before v3.0.
```
import spacy.gold
spacy.gold.USE_NEW_ALIGN = True
```

🔴 Bug fixes

Fix issue #1303: Support multiprocessing in nlp.pipe.
Fix issue #1745: Ditch thinc_gpu_ops for simpler GPU install.
Fix issue #2411: Update Thinc to fix compilation on cygwin.
Fix issue #3412: Prevent division by zero in Vectors.most_similar.
Fix issue #3618: Fix memory leak for long-running parsing processes.
Fix issue #4241: Update Greek lookups in spacy-lookups-data.
Fix issue #4269: Extend unicode character block for Sinhala.
Fix issue #4362: Improve URL_PATTERN and handling in tokenizer.
Fix issue #4373: Make PhraseMatcher.vocab consistent with Matcher.vocab.
Fix issue #4377: Clarify serialization of extension attributes.
Fix issue #4382: Improve usage of pkg_resources and handling of entry points.
Fix issue #4386: Consider batch_size when sorting similar vectors.
Fix issue #4389: Fix ner_jsonl2json converter.
Fix issue #4397: Ensure on_match callback is executed in PhraseMatcher.
Fix issue #4401, #4408: Fix sentence segmentation in Greek, Norwegian and Lithuanian models.
Fix issue #4402: Fix issue with how training data was passed through the pipeline.
Fix issue #4406: Correct spelling in lemmatizer API docs.
Fix issue #4418, #4438: Improve knowledge base and Wikidata parsing.
Fix issue #4435: Fix PhraseMatcher.remove for overlapping patterns.
Fix issue #4443: Fix bug in Vectors.most_similar.
Fix issue #4452: Fix gold.docs_to_json documentation.
Fix issue #4463: Add missing cats to GoldParse.from_annot_tuples in Scorer.
Fix issue #4470: Suppress convert output if writing to stdout.
Fix issue #4475: Correct mistake in docs example.
Fix issue #4485: Update tag maps and docs for English and German.
Fix issue #4493: Update information in spaCy Universe.
Fix issue #4496: Improve docs of PhraseMatcher.add arguments.
Fix issue #4506: Ensure Vectors.most_similar returns 1.0 for identical vectors.
Fix issue #4509: Fix None iteration error in entity linking script.
Fix issue #4524: Fix typo in Parser sample construction of GoldParse.
Fix issue #4528: Fix serialization of extension attribute values in DocBin.
Fix issue #4529: Ensure GoldParse is initialized correctly with misaligned tokens.
Fix issue #4538: Backport memory leak fix to v2.1.x branch and release v2.1.9.

⚠️ Backwards incompatibilities

The unused attributes lemma_rules, lemma_index, lemma_exc and lemma_lookup of the Language.Defaults have now been removed to prevent confusion (e.g. if users add rules that then have no effect). The only place lemmatization tables are stored and can be modified at runtime is via nlp.vocab.lookups.
```
- nlp.Defaults.lemma_lookup["spaCies"] = "spaCy"
+ lemma_lookup = nlp.vocab.lookups.get_table("lemma_lookup")
+ lemma_lookup["spaCies"] = "spaCy"
```

📖 Documentation and examples

Fix various typos and inconsistencies.
Add more projects to the spaCy Universe.

👥 Contributors

Thanks to @tamuhey, @PeterGilles, @akornilo, @danielkingai2, @ghollah, @pberba, @gustavengstrom, @ju-sh, @kabirkhan, @ZhuoruLin, @nipunsadvilkar and @neelkamath for the pull requests and contributions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2.2.2: Multiprocessing, future APIs, Luxembourgish base support & simpler GPU install

✨ New features and improvements

🔴 Bug fixes

⚠️ Backwards incompatibilities

📖 Documentation and examples

👥 Contributors