Skip to content

Releases: explosion/spaCy

v2.3.0: Models for Chinese, Danish, Japanese, Polish and Romanian, new updated models with vectors, faster loading, small API improvements & lots of bug fixes

16 Jun 14:25
d5110ff
Compare
Choose a tag to compare

⚠️ This version of spaCy requires downloading new models. You can use the spacy validate command to find out which models need updating, and print update instructions. If you've been training your own models, you'll need to retrain them with the new version.

✨ New features and improvements

  • NEW: Pretrained model families for Chinese, Danish, Japanese, Polish and Romanian, as well as larger models with vectors for Dutch, German, French, Italian, Greek, Lithuanian, Portuguese and Spanish. 29 new models and 46 model packages in total!
  • NEW: 2-4× faster loading times for models with vectors and 2× smaller packages.
  • NEW: Alpha support for Armenian, Gujarati and Malayalam.
  • NEW: Lookup lemmatization for Polish.
  • NEW: Allow Matcher to match on both Doc and Span objects.
  • NEW: Add Token.is_sent_end property.
  • Improve language data for Danish, Dutch, French, German, Italian, Lithuanian, Norwegian, Romanian and Spanish to better match UD corpora.
  • Update language data for Danish, Kannada, Korean, Persian, Swedish and Urdu.
  • Add support for pkuseg alongside jieba for Chinese.
  • Switch from fugashi to sudachipy for Japanese.
  • Improve punctuation used in sentencizer.
  • Switch to new and more consistent alignment method in gold.align.
  • Reduce stored lexemes data and move non-derivable features to spacy-lookups-data.

🔴 Bug fixes

  • Fix issue #5056: Introduce support for matching Span objects.
  • Fix issue #5086: Remove Vectors.from_glove.
  • Fix issue #5131: Improve data processing in named entity linking scripts.
  • Fix issue #5137: Fix passing of component configuration to component.
  • Fix issue #5144: Fix sentence comparison in test util.
  • Fix issue #5166: Fix handling of exclusive_classes in textcat ensemble.
  • Fix issue #5170: Set rank for new vector in Vocab.set_vector.
  • Fix issue #5181: Prevent None values in gold fields.
  • Fix issue #5191: Fix GoldParse initialization when the number of tokens has changed.
  • Fix issue #5193: Correctly pin cupy-cuda extra dependencies.
  • Fix issue #5200: Fix minor bugs in train CLI.
  • Fix issue #5216: Modify Vectors.resize to work with cupy.
  • Fix issue #5228: Raise error for inplace resize with new vector dimension.
  • Fix issue #5230: Fix unittest warnings when saving a model.
  • Fix issue #5257: Use inline flags in token_match patterns.
  • Fix issue #5278, #5359: Add missing __init__.py files to language data tests.
  • Fix issue #5281: Fix comparison predicate handling for !=.
  • Fix issue #5287: Normalize TokenC.sent_start values for Matcher.
  • Fix issue #5292: Fix typo in option name --n-save_every.
  • Fix issue #5303: Use max(uint64) for OOV lexeme rank.
  • Fix issue #5311: Fix alignment of cards on landing page.
  • Fix issue #5320: Fix most_similar for vectors with unused rows.
  • Fix issue #5344: Prevent pip from installing spaCy on Python 3.4.
  • Fix issue #5356: Fix bug in Span.similarity that could trigger TypeError.
  • Fix issue #5361: Fix problems with lower and whitespace in variants.
  • Fix issue #5373: Improve exceptions for 'd (would/had) in English.
  • Fix issue #5387: Fix logic in train CLI timing eval on CPU/GPU.
  • Fix issue #5393, #5458: Fix check for overlapping spans in noun chunks.
  • Fix issue #5429: Modify array type to accommodate OOV_RANK.
  • Fix issue #5430: Check that row is within bounds when adding vector.
  • Fix issue #5435: Use Token.sent_start for Span.sent.
  • Fix issue #5436: Fix ErrorsWithCodes().__class__ return value.
  • Fix issue #5450: Disallow merging 0-length spans.

⚠️ Backwards incompatibilities

  • This version of spaCy requires downloading new models. You can use the spacy validate command to find out which models need updating, and print update instructions.
  • If you're training new models, you'll want to install the package spacy-lookups-data, which now includes both the lemmatization tables (as in v2.2) and the normalization tables (new in v2.3). If you're using pretrained models, nothing changes, because the relevant tables are included in the model packages.
  • Due to the updated Universal Dependencies training data, the fine-grained part-of-speech tags will change for many provided language models. The coarse-grained part-of-speech tagset remains the same, but the mapping from particular fine-grained to coarse-grained tags may show minor differences.
  • For French, Italian, Portuguese and Spanish, the fine-grained part-of-speech tagsets contain new merged tags related to contracted forms, such as ADP_DET for French "au", which maps to UPOS ADP based on the head "à". This increases the accuracy of the models by improving the alignment between spaCy's tokenization and Universal Dependencies multi-word tokens used for contractions.
  • spaCy's custom warnings have been replaced with native Python warnings. Instead of setting SPACY_WARNING_IGNORE, use the warnings filters to manage warnings.

📖 Documentation and examples

  • Fix various typos and inconsistencies.
  • Add new projects to the spaCy Universe.
  • Move bin/wiki_entity_linking scripts for Wikipedia to projects repo.

🔥 ICYMI: We recently updated the free and interactive spaCy course to include translations for German (with German NLP examples), Spanish (with Spanish NLP examples) and Japanese, as well as videos for English and German. Translations for Chinese (with Chinese NLP examples), French (with French NLP examples) and Russian coming soon!

📦 Model packages (43)

Model Language Version Vectors
[zh_core_web_sm] Chinese 2.3.0 𐄂
[zh_core_web_md] Chinese 2.3.0
[zh_core_web_lg] Chinese 2.3.0
[da_core_news_sm] Danish 2.3.0 𐄂
[da_core_news_md] Danish 2.3.0
[da_core_news_lg] Danish 2.3.0
[nl_core_news_sm] Dutch 2.3.0 𐄂
[nl_core_news_md] Dutch 2.3.0
[nl_core_news_lg] Dutch 2.3.0
en_core_web_sm English 2.3.0 𐄂
en_core_web_md English 2.3.0
en_core_web_lg English 2.3.0
[fr_core_news_sm] French 2.3.0 𐄂
[fr_core_news_md] French 2.3.0
[fr_core_news_lg] French 2.3.0
de_core_news_sm German 2.3.0 𐄂
de_core_news_md German 2.3.0
de_core_news_lg German 2.3.0
[el_core_news_sm] Greek 2.3.0 𐄂
[el_core_news_md] Greek 2.3.0
[el_core_news_lg] Greek 2.3.0
[it_core_news_sm] Italian 2.3.0 𐄂
[it_core_news_md] Italian 2.3.0
[it_core_news_lg] Italian 2.3.0
[ja_core_news_sm] Japanese 2.3.0 𐄂
[ja_core_news_md] Japanese 2.3.0
[ja_core_news_lg] Japanese 2.3.0
[lt_core_news_sm] Lithuanian 2.3.0 𐄂
[lt_core_news_md] Lithuanian 2.3.0
[lt_core_news_lg] Lithuanian 2.3.0
[nb_core_news_sm] Norwegian Bokmål 2.3.0 𐄂
[nb_core_news_md] Norwegian Bokmål 2.3.0
[nb_core_news_lg] Norwegian Bokmål 2.3.0
[pl_core_news_sm] Polish 2.3.0 𐄂
[pl_core_news_md] Polish 2.3.0
[pl_core_news_lg] Polish 2.3.0
pt_core_news_sm Portuguese 2.3.0 𐄂
pt_core_news_md Portuguese 2.3.0
pt_core_news_lg Portuguese 2.3.0
[ro_core_news_sm] Romanian 2.3.0 𐄂
[ro_core_news_md] Romanian 2.3.0
[ro_core_news_lg] Romanian 2.3.0
es_core_news_sm Spanish 2.3.0 𐄂
es_core_news_md Spanish 2.3.0
es_core_news_lg Spanish 2.3.0
[xx_ent_wiki_sm] Multi-language 2.3.0 𐄂
Read more

v2.2.4: Alpha support for Yoruba and Basque, language data improvements and lots of bug fixes

12 Mar 13:40
Compare
Choose a tag to compare

✨ New features and improvements

  • NEW: Add Span.char_span method.
  • NEW: Base language support for Yoruba and Basque.
  • NEW: Add --tag-map-path argument to debug-data and train commands.
  • NEW Add add_lemma option to displacy dependency visualizer.
  • Add IDX as an attribute available via Doc.to_array.
  • Improve speed of adding large number of patterns to EntityRuler.
  • Replace python-mecab3 with fugashi for Japanese.
  • Improve language data for Norwegian, Luxembourgish, Finnish, Slovak, Romanian, Greek and German.

🔴 Bug fixes

  • Fix issue #3979, #4819, #4871: Add tok2vec parameters to train command.
  • Fix issue #4009: Fix use of pretrained vectors in text classifier.
  • Fix issue #4342: Improve CLI training with base model.
  • Fix issue #4432: Add destructors for states in TransitionSystem.
  • Fix issue #4440: Require HEAD for is_parsed in Doc.from_array.
  • Fix issue #4615: Update SHAPE docs and examples.
  • Fix issue #4665: Allow HEAD field in CoNLL-U format to be an underscore.
  • Fix issue #4673: Ensure correct array module is used when returning a vector via Vocab.
  • Fix issue #4674: Make set_entities in the KnowledgeBase more robust.
  • Fix issue #4677: Add missing tags to tag maps for el, es and pt.
  • Fix issue #4688: Iterate over lr_edges until Doc.sents are correct.
  • Fix issue #4703, #4823: Facilitate large training files.
  • Fix issue #4707: Auto-exclude disabled when calling from_disk during load.
  • Fix issue #4717: Fix int value handling in Matcher.
  • Fix issue #4719: Add message when cli train script throws exception.
  • Fix issue #4723: Update EntityLinker example.
  • Fix issue #4725: Take care of global vectors in multiprocessing.
  • Fix issue #4770: Include Doc.cats in serialization of Doc and DocBin.
  • Fix issue #4772: Fix bug in EntityLinker.predict.
  • Fix issue #4777: Fix link to user hooks in documentation.
  • Fix issue #4829: Update build dependencies in pyproject.toml.
  • Fix issue #4830: Warn for punctuation in entities when training with noise.
  • Fix issue #4833: Make example scripts work with transformer starter models.
  • Fix issue #4849: Fix serialization of ENT_ID.
  • Fix issue #4862: Fix and improve URL pattern.
  • Fix issue #4868: Include .pyx and .pxd files in the distribution.
  • Fix issue #4876: Add friendlier error to entity linking example script.
  • Fix issue #4903: Fix handling of custom underscore attributes during multiprocessing.
  • Fix issue #4924: Fix handling of empty docs or golds in Language.evaluate.
  • Fix issue #4934: Prevent updating component config if the Model was already defined.
  • Fix issue #4935: Fix Sentencizer.pipe for empty Doc.
  • Fix issue #4961: Remove old docs section links.
  • Fix issue #4965: Sync Span.__eq__ and Span.__hash__.
  • Fix issue #4975: Adjust srsly pin.
  • Fix issue #5048: Fix behavior of get_doc test utility.
  • Fix issue #5073: Normalize IS_SENT_START to SENT_START for Matcher.
  • Fix issue #5075: Make it impossible to create invalid heads with Doc.from_array.
  • Fix issue #5082: Correctly set vector of merged span in merge_entities.
  • Fix issue #5115: Ensure paths in Tokenizer.to_disk and Tokenizer.from_disk.
  • Fix issue #5117: Clarify behavior of Doc.is_ flags for empty Docs.

📖 Documentation and examples

  • Fix various typos and inconsistencies.
  • Add new projects to the spaCy Universe.

👥 Contributors

Thanks to @polm, @mmaybeno, @jarib, @questoph, @aajanki, @mr-bjerre, @Tclack88, @thiagola92, @tamuhey, @Olamyy, @AlJohri, @iechevarria, @iurshina, @lineality, @pbadeer, @BramVanroy, @kabirkhan, @ceteri, @omri374, @maknotavailable, @onlyanegg, @drndos, @ju-sh, @nlptechbook, @chkoar, @Jan-711, @MisterKeefe, @bryant1410, @mirfan899, @dhpollack and @mabraham for the pull requests and contributions!

v2.2.3: Tokenizer.explain, Korean base support, dependency scores per label and bug fixes

21 Nov 18:21
Compare
Choose a tag to compare

✨ New features and improvements

  • NEW: Tokenizer.explain method to see which rule or pattern was matched.
    tok_exp = nlp.tokenizer.explain("(don't)")
    assert [t[0] for t in tok_exp] == ["PREFIX", "SPECIAL-1", "SPECIAL-2", "SUFFIX"]
    assert [t[1] for t in tok_exp] == ["(", "do", "n't", ")"]
  • NEW: Official Python 3.8 wheels for spaCy and its dependencies.
  • Base language support for Korean.
  • Add Scorer.las_per_type (labelled depdencency scores per label).
  • Rework Chinese language initialization and tokenization
  • Improve language data for Luxembourgish.

🔴 Bug fixes

  • Fix issue #4573, #4645: Improve tokenizer usage docs.
  • Fix issue #4575: Add error in debug-data if no dev docs are available.
  • Fix issue #4582: Make as_tuples=True in Language.pipe work with multiprocessing.
  • Fix issue #4590: Correctly call on_match in DependencyMatcher.
  • Fix issue #4593: Build wheels for Python 3.8.
  • Fix issue #4604: Fix realloc in Retokenizer.split.
  • Fix issue #4656: Fix conllu2json converter when -n > 1.
  • Fix issue #4662: Fix Language.evaluate for components without .pipe method.
  • Fix issue #4670: Ensure EntityRuler is deserialized correctly from disk.
  • Fix issue #4680: Raise error if non-string labels are added to Tagger or TextCategorizer.
  • Fix issue #4691: Make Vectors.find return keys in correct order.

📖 Documentation and examples

  • Fix various typos and inconsistencies.

👥 Contributors

Thanks to @yash1994, @walterhenry, @prilopes, @f11r, @questoph, @erip, @richardpaulhudson and @GuiGel for the pull requests and contributions.

v2.2.2: Multiprocessing, future APIs, Luxembourgish base support & simpler GPU install

31 Oct 16:34
Compare
Choose a tag to compare

✨ New features and improvements

  • NEW: Support multiprocessing in nlp.pipe via the n_process argument (Python 3 only).
  • Base language support for Luxembourgish.
  • Add noun chunks iterator for Swedish.
  • Retrained models for Greek, Norwegian Bokmål and Lithuanian that now correctly support parser-based sentence segmentation.
  • Repackaged models for Greek and German with improved lookup tables via spacy-lookups-data.
  • Add warning in debug-data for low sentences per doc ratio.
  • Improve checks and errors related to ill-formed IOB input in convert and debug-data CLI.
  • Support training dict format as JSONL.
  • Make EntityRuler ID resolution 2× faster and support "id" in patterns to set Token.ent_id.
  • Improve rendering of named entity spans in displacy for RTL languages.
  • Update Thinc to ditch thinc_gpu_ops for simpler GPU install.
  • Support Mish activation in spacy pretrain.
  • Add forwards-compatible support for new Language.disable_pipes API, which will become
    the default in the future. The method can now also take a list of component names as its first argument (instead of a variable number of arguments).
    - disabled = nlp.disable_pipes("tagger", "parser")
    + disabled = nlp.disable_pipes(["tagger", "parser"])
  • Add forwards-compatible support for new Matcher.add and PhraseMatcher.add API, which will become the default in the future. The patterns are now the second argument and a list (instead of a variable number of arguments). The on_match callback becomes an optional keyword argument.
    patterns = [[{"TEXT": "Google"}, {"TEXT": "Now"}], [{"TEXT": "GoogleNow"}]]
    - matcher.add("GoogleNow", None, *patterns)
    + matcher.add("GoogleNow", patterns)
    - matcher.add("GoogleNow", on_match, *patterns)
    + matcher.add("GoogleNow", patterns, on_match=on_match)
  • Add new and improved tokenization alignment in gold.align behind a feature flag. The new alignment may produce backwards-incompatible results, so it won't be enabled by default before v3.0.
    import spacy.gold
    spacy.gold.USE_NEW_ALIGN = True

🔴 Bug fixes

  • Fix issue #1303: Support multiprocessing in nlp.pipe.
  • Fix issue #1745: Ditch thinc_gpu_ops for simpler GPU install.
  • Fix issue #2411: Update Thinc to fix compilation on cygwin.
  • Fix issue #3412: Prevent division by zero in Vectors.most_similar.
  • Fix issue #3618: Fix memory leak for long-running parsing processes.
  • Fix issue #4241: Update Greek lookups in spacy-lookups-data.
  • Fix issue #4269: Extend unicode character block for Sinhala.
  • Fix issue #4362: Improve URL_PATTERN and handling in tokenizer.
  • Fix issue #4373: Make PhraseMatcher.vocab consistent with Matcher.vocab.
  • Fix issue #4377: Clarify serialization of extension attributes.
  • Fix issue #4382: Improve usage of pkg_resources and handling of entry points.
  • Fix issue #4386: Consider batch_size when sorting similar vectors.
  • Fix issue #4389: Fix ner_jsonl2json converter.
  • Fix issue #4397: Ensure on_match callback is executed in PhraseMatcher.
  • Fix issue #4401, #4408: Fix sentence segmentation in Greek, Norwegian and Lithuanian models.
  • Fix issue #4402: Fix issue with how training data was passed through the pipeline.
  • Fix issue #4406: Correct spelling in lemmatizer API docs.
  • Fix issue #4418, #4438: Improve knowledge base and Wikidata parsing.
  • Fix issue #4435: Fix PhraseMatcher.remove for overlapping patterns.
  • Fix issue #4443: Fix bug in Vectors.most_similar.
  • Fix issue #4452: Fix gold.docs_to_json documentation.
  • Fix issue #4463: Add missing cats to GoldParse.from_annot_tuples in Scorer.
  • Fix issue #4470: Suppress convert output if writing to stdout.
  • Fix issue #4475: Correct mistake in docs example.
  • Fix issue #4485: Update tag maps and docs for English and German.
  • Fix issue #4493: Update information in spaCy Universe.
  • Fix issue #4496: Improve docs of PhraseMatcher.add arguments.
  • Fix issue #4506: Ensure Vectors.most_similar returns 1.0 for identical vectors.
  • Fix issue #4509: Fix None iteration error in entity linking script.
  • Fix issue #4524: Fix typo in Parser sample construction of GoldParse.
  • Fix issue #4528: Fix serialization of extension attribute values in DocBin.
  • Fix issue #4529: Ensure GoldParse is initialized correctly with misaligned tokens.
  • Fix issue #4538: Backport memory leak fix to v2.1.x branch and release v2.1.9.

⚠️ Backwards incompatibilities

  • The unused attributes lemma_rules, lemma_index, lemma_exc and lemma_lookup of the Language.Defaults have now been removed to prevent confusion (e.g. if users add rules that then have no effect). The only place lemmatization tables are stored and can be modified at runtime is via nlp.vocab.lookups.
    - nlp.Defaults.lemma_lookup["spaCies"] = "spaCy"
    + lemma_lookup = nlp.vocab.lookups.get_table("lemma_lookup")
    + lemma_lookup["spaCies"] = "spaCy"

📖 Documentation and examples

  • Fix various typos and inconsistencies.
  • Add more projects to the spaCy Universe.

👥 Contributors

Thanks to @tamuhey, @PeterGilles, @akornilo, @danielkingai2, @ghollah, @pberba, @gustavengstrom, @ju-sh, @kabirkhan, @ZhuoruLin, @nipunsadvilkar and @neelkamath for the pull requests and contributions.

v2.1.9: Backport memory leak fix

28 Oct 16:22
Compare
Choose a tag to compare

This is a small maintenance update that backports a bug fix for a memory leak that'd occur in long-running parsing processes. It's intended for users who can't or don't yet want to upgrade to spaCy v2.2 (e.g. because it requires retraining all the models). If you're able to upgrade, you shouldn't use this version and instead install the latest v2.2.

🔴 Bug fixes

  • Fix issue #3618: Fix memory leak for long-running parsing processes.
  • Fix issue #4538: Backport memory leak fix to v2.1.x branch.

v2.2.1: Fix DocBin and Dutch model, improve Vectors.most_similar

03 Oct 14:22
Compare
Choose a tag to compare

✨ New features and improvements

  • Make Vectors.most_similar return the top most similar vectors instead of only one.

🔴 Bug fixes

  • Fix issue #4365: Fix tag map in Dutch model.
  • Fix issue #4368: Fix initialization of DocBin with attributes.

📖 Documentation and examples

👥 Contributors

Thanks to @bintay and @svlandeg for the pull requests and contributuons.

v2.2.0: Norwegian & Lithuanian models, better Dutch NER, smaller install, faster matching & more

02 Oct 14:47
Compare
Choose a tag to compare

⚠️ This version of spaCy requires downloading new models. You can use the spacy validate command to find out which models need updating, and print update instructions. If you've been training your own models, you'll need to retrain them with the new version.

✨ New features and improvements

  • NEW: Pretrained core models for Norwegian (MIT) and Lithuanian (CC BY-SA).
  • NEW: Better pre-trained Dutch NER using custom labelled UD corpus instead of WikiNER.
  • NEW: Make spaCy roughly 5-10× smaller on disk (depending on your platform) by compressing and moving lookups to a separate package.
  • NEW: EntityLinker and KnowledgeBase API to train and access entity linking models, plus scripts to train your own Wikidata models.
  • NEW: 10× faster PhraseMatcher and improved phrase matching algorithm.
  • NEW: DocBin class to efficiently serialize collections of Doc objects.
  • NEW: Train text classification models on the command line with spacy train and get textcat results via the Scorer.
  • NEW: debug-data command to validate your training and development data, get useful stats, and find problems like invalid entity annotations, cyclic dependencies, low data labels and more.
  • NEW: Efficient Lookups class using Bloom filters that allows storing, accessing and serializing large dictionaries via vocab.lookups.
  • Data augmentation in spacy train via the --orth-variant-level flag, which defines the percentage of occurrences of some tokens subject to replacement during training.
  • Add nlp.pipe_labels (labels assigned by pipeline components) and include "labels" in nlp.meta.
  • Support spacy_displacy_colors entry point to allow packages to add entity colors to displacy.
  • Allow template config option in displacy to customize entity HTML template.
  • Improve match pattern validation and handling of unsupported attributes.
  • Add lookup lemmatization data for Croatian and Serbian.
  • Update and improve language data for Chinese, Croatian, Thai, Romanian, Hindi and English.

🔴 Bug fixes

  • Fix issue #3258: Reduce package size on disk by moving and compressing large dictionaries.
  • Fix issue #3540: Update lemma and vector information after splitting a token.
  • Fix issue #3687: Automatically skip duplicates in Doc.retokenize.
  • Fix issue #3830: Retrain German model and fix subtok errors.
  • Fix issue #3850: Allow customizing entity HTML template in displaCy.
  • Fix issue #3879, #3951, #4154: Fix bug in Matcher retry loop that'd cause problems with ? operator.
  • Fix issue #3917: Raise error for negative token indices in displacy.
  • Fix issue #3922: Add PhraseMatcher.remove method.
  • Fix issue #3959, #4133: Make sure both pos and tag are correctly serialized.
  • Fix issue #3972: Ensure PhraseMatcher returns multiple matches for identical rules.
  • Fix issue #4020: Raise error for overlapping entities in biluo_tags_from_offsets.
  • Fix issue #4051: Ensure retokenizer sets POS tags correctly on merge.
  • Fix issue #4070: Improve token pattern checking without validation.
  • Fix issue #4096: Add checks for cycles in debug-data.
  • Fix issue #4100: Improve docs on phrase pattern attributes.
  • Fix issue #4102: Correct mistakes in English lookup lemmatizer data.
  • Fix issue #4104: Make visualized NER examples in docs more clear.
  • Fix issue #4107: Automatically set span root attributes on merging.
  • Fix issue #4111, #4170: Improve NER/IOB converters.
  • Fix issue #4120: Correctly handle ? operator at the end of pattern.
  • Fix issue #4123: Provide more details in cycle error message E069.
  • Fix issue #4138: Correctly open .html files as UTF-8 in evaluate command.
  • Fix issue #4139: Make emoticon data a raw string.
  • Fix issue #4148: Add missing API docs for force flag on set_extension.
  • Fix issue #4155: Correct language code for Serbian.
  • Fix issue #4165: Add more attributes to matcher validation schema.
  • Fix issue #4190: Fix caching issue that'd cause tokenizer to not be deserialized correctly.
  • Fix issue #4200: Work around tqdm bug that'd remove text color from terminal output.
  • Fix issue #4229: Fix handling of pre-set entities.
  • Fix issue #4238: Flush tokenizer cache when affixes, token_match, or special cases are modified.
  • Fix issue #4242: Make .pos/.tag distinction more clear in the docs.
  • Fix issue #4245: Fix bug that occurred when processing empty string in Korean.
  • Fix issue #4262: Fix handling of spaces in Japanese.
  • Fix issue #4269: Tokenize punctuation correctly in Kannada, Tamil, and Telugu and add unicode characters to default sentencizer config.
  • Fix issue #4270: Fix --vectors-loc documentation.
  • Fix issue #4302: Remove duplicate Parser.tok2vec property.
  • Fix issue #4303: Correctly support as_tuples and return_matches in Matcher.pipe.
  • Fix issue #4307: Ensure that pre-set entities are preserved and allow overwriting unset tokens.
  • Fix issue #4308: Fix bug that could cause PhraseMatcher with very large lists to miss matches.
  • Fix issue #4348: Ensure training doesn't crash with empty batches.

⚠️ Backwards incompatibilities

  • This version of spaCy requires downloading new models. You can use the spacy validate command to find out which models need updating, and print update instructions.
  • The lemmatization tables have been moved to their own package, spacy-lookups-data, which is not installed by default. If you're using pre-trained models, nothing changes, because the tables are now included in the model packages. If you want to use the lemmatizer for other languages that don't yet have pre-trained models (e.g. Turkish or Croatian) or start off with a blank model that contains lookup data (e.g. spacy.blank("en")), you'll need to explicitly install spaCy plus data via pip install spacy[lookups]. The data will be registered automatically via entry points.
  • Lemmatization tables (rules, exceptions, index and lookups) are now part of the Vocab and serialized with it. This means that serialized objects (nlp, pipeline components, vocab) will now include additional data, and models written to disk will include additional files.
  • The Lemmatizer class is now initialized with an instance of Lookups containing the rules and tables, instead of dicts as separate arguments. This makes it easier to share data tables and modify them at runtime. This is mostly internals, but if you've been implementing a custom Lemmatizer, you'll need to update your code.
  • If you've been training your own models, you'll need to retrain them with the new version.
  • The Dutch model has been trained on a new NER corpus (custom labelled UD instead of WikiNER), so their predictions may be very different compared to the previous version. The results should be significantly better and more generalizable, though.
  • The spacy download command does not set the --no-deps pip argument anymore by default, meaning that model package dependencies (if available) will now be also downloaded and installed. If spaCy (which is also a model dependency) is not installed in the current environment, e.g. if a user has built from source, --no-deps is added back automatically to prevent spaCy from being downloaded and installed again from pip.
  • The built-in biluo_tags_from_offsets converter is now stricter and will raise an error if entities are overlapping (instead of silently skipping them). If your data contains invalid entity annotations, make sure to clean it and resolve conflicts. You can now also use the new debug-data command to find problems in your data.
  • Pipeline components can now overwrite IOB tags of tokens that are not yet part of an entity. Once a token has an ent_iob value set, it won't be reset to an "unset" state and will always have at least O assigned. list(doc.ents) now actually keeps the annotations on the token level consistent, instead of resetting O to an empty string.
  • The default punctuation in the Sentencizer has been extended and now includes more characters common in various languages. This also means that the results it produces may change, depending on your text. If you want the previous behaviour with limited characters, set punct_chars=[".", "!", "?"] on initialization.
  • The PhraseMatcher algorithm was rewritten from scratch and it's now 10× faster. The rewrite also resolved a few subtle bugs with very large terminology lists. So if you were matching large lists, you may see slightly different results – however, the results should now be fully correct. See #4309 for details on this change.
  • The Serbian language class (introduced in v2.1.8) incorrectly used the language code rs instead of sr. This has now been fixed, so Serbian is now available via spacy.lang.sr.
  • The "sources" in the meta.json have changed from a list of strings to a list of dicts. This is mostly internals, but if your code used nlp.meta["sources"], you might have to update it.

📈 Benchmarks

Model Language Version UAS LAS POS NER F Vec Size
[en_core_web_sm] English 2.2.0 91.61 89.71 97.03 85.07 𐄂 11 MB
[en_core_web_md] English 2.2.0 91.65 89.77 97.14 86.10 91 MB
[en_core_web_lg] English ...
Read more

v2.1.8: Usability improvements and Serbian alpha tokenization

08 Aug 09:20
Compare
Choose a tag to compare

✨ New features and improvements

  • NEW: Alpha tokenization support for Serbian
  • Improve language data for Urdu.
  • Support installing and loading model packages in the same session.

🔴 Bug fixes

  • Fix issue #4002: Make PhraseMatcher work as expected for NORM attribute.
  • Fix issue #4063: Improve docs on Matcher attributes.
  • Fix issue #4068: Make Korean work as expected on Python 2.7.
  • Fix issue #4069: Add validate option to EntityRuler.
  • Fix issue #4074: Raise error if annotation dict in simple training style has unexpected keys.
  • Fix issue #4081: Fix typo in pyproject.toml.
  • Fix handling of keyword arguments in Language.evaluate.

📖 Documentation and examples

👥 Contributors

Thanks to @akornilo, @mirfan899, @veer-bains, @seppeljordan, @Pavle992, @svlandeg, @jenojp and @adrianeboyd for the pull requests and contributions.

v2.1.7: Improved evaluation, better language factories and bug fixes

01 Aug 17:43
Compare
Choose a tag to compare

✨ New features and improvements

  • Add Token.tensor and Span.tensor attributes.
  • Support simple training format of (text, annotations) instead of only (doc, gold) for nlp.evaluate.
  • Add support for "lang_factory" setting in model meta.json (see #4031).
  • Also support "requirements" in meta.json to define packages for setup's install_requires.
  • Improve Pipe base class methods and make them less presumptuous.
  • Improve Danish and Korean tokenization.
  • Improve error messages when deserializing model fails.

🔴 Bug fixes

  • Fix issue #3669, #3962: Fix dependency copy in Span.as_doc that could cause segfault.
  • Fix issue #3968: Fix bug in per-entity scores.
  • Fix issue #4000: Improve entity linking API.
  • Fix issue #4022: Fix error when Korean text contains special characters.
  • Fix issue #4030: Handle edge case when calling TextCategorizer.predict with empty Doc.
  • Fix issue #4045: Correct Span.sent docs.
  • Fix issue #4048: Fix init-model command if there's no vocab.
  • Fix issue #4052: Improve per-type scoring of NER.
  • Fix issue #4054: Ensure the lang of nlp and nlp.vocab stay consistent.
  • Fix bugs in Token.similarity and Span.similarity when called via hook.

📖 Documentation and examples

👥 Contributors

Thanks to @sorenlind, @pmbaumgartner, @svlandeg, @FallakAsad, @BreakBB, @adrianeboyd, @polm, @b1uec0in, @mdaudali and @ejarkm for the pull requests and contributions.

v2.1.6: Fix order of symbols that caused tag maps to be out-of-sync

12 Jul 16:18
Compare
Choose a tag to compare

🔴 Bug fixes

  • Fix issue #3958: Fix order of symbols that caused tag maps to be out-of-sync.