v2.2.2: Multiprocessing, future APIs, Luxembourgish base support & simpler GPU install
✨ New features and improvements
- NEW: Support multiprocessing in
nlp.pipe
via then_process
argument (Python 3 only). - Base language support for Luxembourgish.
- Add noun chunks iterator for Swedish.
- Retrained models for Greek, Norwegian Bokmål and Lithuanian that now correctly support parser-based sentence segmentation.
- Repackaged models for Greek and German with improved lookup tables via
spacy-lookups-data
. - Add warning in
debug-data
for low sentences per doc ratio. - Improve checks and errors related to ill-formed IOB input in
convert
anddebug-data
CLI. - Support training dict format as JSONL.
- Make
EntityRuler
ID resolution 2× faster and support"id"
in patterns to setToken.ent_id
. - Improve rendering of named entity spans in
displacy
for RTL languages. - Update Thinc to ditch
thinc_gpu_ops
for simpler GPU install. - Support Mish activation in
spacy pretrain
. - Add forwards-compatible support for new
Language.disable_pipes
API, which will become
the default in the future. The method can now also take a list of component names as its first argument (instead of a variable number of arguments).- disabled = nlp.disable_pipes("tagger", "parser") + disabled = nlp.disable_pipes(["tagger", "parser"])
- Add forwards-compatible support for new
Matcher.add
andPhraseMatcher.add
API, which will become the default in the future. The patterns are now the second argument and a list (instead of a variable number of arguments). Theon_match
callback becomes an optional keyword argument.patterns = [[{"TEXT": "Google"}, {"TEXT": "Now"}], [{"TEXT": "GoogleNow"}]] - matcher.add("GoogleNow", None, *patterns) + matcher.add("GoogleNow", patterns) - matcher.add("GoogleNow", on_match, *patterns) + matcher.add("GoogleNow", patterns, on_match=on_match)
- Add new and improved tokenization alignment in
gold.align
behind a feature flag. The new alignment may produce backwards-incompatible results, so it won't be enabled by default before v3.0.import spacy.gold spacy.gold.USE_NEW_ALIGN = True
🔴 Bug fixes
- Fix issue #1303: Support multiprocessing in
nlp.pipe
. - Fix issue #1745: Ditch
thinc_gpu_ops
for simpler GPU install. - Fix issue #2411: Update Thinc to fix compilation on cygwin.
- Fix issue #3412: Prevent division by zero in
Vectors.most_similar
. - Fix issue #3618: Fix memory leak for long-running parsing processes.
- Fix issue #4241: Update Greek lookups in
spacy-lookups-data
. - Fix issue #4269: Extend unicode character block for Sinhala.
- Fix issue #4362: Improve
URL_PATTERN
and handling in tokenizer. - Fix issue #4373: Make
PhraseMatcher.vocab
consistent withMatcher.vocab
. - Fix issue #4377: Clarify serialization of extension attributes.
- Fix issue #4382: Improve usage of
pkg_resources
and handling of entry points. - Fix issue #4386: Consider
batch_size
when sorting similar vectors. - Fix issue #4389: Fix
ner_jsonl2json
converter. - Fix issue #4397: Ensure
on_match
callback is executed inPhraseMatcher
. - Fix issue #4401, #4408: Fix sentence segmentation in Greek, Norwegian and Lithuanian models.
- Fix issue #4402: Fix issue with how training data was passed through the pipeline.
- Fix issue #4406: Correct spelling in lemmatizer API docs.
- Fix issue #4418, #4438: Improve knowledge base and Wikidata parsing.
- Fix issue #4435: Fix
PhraseMatcher.remove
for overlapping patterns. - Fix issue #4443: Fix bug in
Vectors.most_similar
. - Fix issue #4452: Fix
gold.docs_to_json
documentation. - Fix issue #4463: Add missing
cats
toGoldParse.from_annot_tuples
inScorer
. - Fix issue #4470: Suppress convert output if writing to
stdout
. - Fix issue #4475: Correct mistake in docs example.
- Fix issue #4485: Update tag maps and docs for English and German.
- Fix issue #4493: Update information in spaCy Universe.
- Fix issue #4496: Improve docs of
PhraseMatcher.add
arguments. - Fix issue #4506: Ensure
Vectors.most_similar
returns1.0
for identical vectors. - Fix issue #4509: Fix
None
iteration error in entity linking script. - Fix issue #4524: Fix typo in
Parser
sample construction ofGoldParse
. - Fix issue #4528: Fix serialization of extension attribute values in
DocBin
. - Fix issue #4529: Ensure
GoldParse
is initialized correctly with misaligned tokens. - Fix issue #4538: Backport memory leak fix to v2.1.x branch and release v2.1.9.
⚠️ Backwards incompatibilities
- The unused attributes
lemma_rules
,lemma_index
,lemma_exc
andlemma_lookup
of theLanguage.Defaults
have now been removed to prevent confusion (e.g. if users add rules that then have no effect). The only place lemmatization tables are stored and can be modified at runtime is vianlp.vocab.lookups
.- nlp.Defaults.lemma_lookup["spaCies"] = "spaCy" + lemma_lookup = nlp.vocab.lookups.get_table("lemma_lookup") + lemma_lookup["spaCies"] = "spaCy"
📖 Documentation and examples
- Fix various typos and inconsistencies.
- Add more projects to the spaCy Universe.
👥 Contributors
Thanks to @tamuhey, @PeterGilles, @akornilo, @danielkingai2, @ghollah, @pberba, @gustavengstrom, @ju-sh, @kabirkhan, @ZhuoruLin, @nipunsadvilkar and @neelkamath for the pull requests and contributions.