Release v2.1.0a3: New models, ULMFit/BERT/Elmo-like pretraining, joint word segmentation and parsing, better Matcher, bug fixes & more · explosion/spaCy

🌙 This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly. It's not intended for production use.

pip install -U spacy-nightly

If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models – see below for details and benchmarks.

⚠️ Due to difficulties linking our new blis for faster platform-independent matrix multiplication, this nightly release currently doesn't work on Python 2.7 on Windows. We expect this problem to be corrected in the future.

✨ New features and improvements

Tagger, Parser & NER

NEW: Experimental ULMFit/BERT/Elmo-like pretraining (see #2931) via the new spacy pretrain command. This pre-trains the CNN using BERT's cloze task. A new trick we're calling Language Modelling with Approximate Outputs is used to apply the pre-training to smaller models. The pre-training outputs CNN and embedding weights that can be used in spacy train, using the new -t2v argument.
NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
Make parser, tagger and NER faster, through better hyperparameters.
Add EntityRecognizer.labels property.
Remove document length limit during training, by implementing faster Levenshtein alignment.
Use Thinc v7.0, which defaults to single-thread with fast blis kernel for matrix multiplication. Parallelisation should be performed at the task level, e.g. by running more containers.

Models & Language Data

NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
NEW: The English and German models are now available under the MIT license.
NEW: Statistical models for Greek.

CLI

NEW: pretrain command for ULMFit/BERT/Elmo-like pretraining (see #2931).
NEW: New ud-train command, to train and evaluate using the CoNLL 2017 shared task data.
Check if model is already installed before downloading it via spacy download.
Pass additional arguments of download command to pip to customise installation.
Improve train command by letting GoldCorpus stream data, instead of loading into memory.
Improve init-model command, including support for lexical attributes and word-vectors, using a variety of formats. This replaces the spacy vocab command, which is now deprecated.
Add support for multi-task objectives to train command.
Add support for data-augmentation to train command.

Other

NEW: Doc.retokenize context manager for merging tokens more efficiently.
NEW: Add support for custom pipeline component factories via entry points (#2348).
NEW: Implement fastText vectors with subword features.
NEW: Built-in rule-based NER component to add entities based on match patterns (see #2513).
NEW: Allow PhraseMatcher to match on token attributes other than ORTH, e.g. LOWER (for case-insensitive matching) or even POS or TAG.
Add warnings if .similarity method is called with empty vectors or without word vectors.
Improve rule-based Matcher and add return_matches keyword argument to Matcher.pipe to yield (doc, matches) tuples instead of only Doc objects, and as_tuples to add context to the Doc objects.
Make stop words via Token.is_stop and Lexeme.is_stop case-insensitive.

🚧 Under construction

This section includes new features and improvements that are planned for the stable v2.1.x release, but aren't included in the nightly yet.

Enhanced pattern API for rule-based Matcher (see #1971).

Improve tokenizer performance (see #1642).

Allow retokenizer to update Lexeme attributes on merge (see #2390).

md and lg models and new, pre-trained word vectors for German, French, Spanish, Italian, Portuguese and Dutch.

Improved JSON(L) format for training (see #2928, #2932).

Doc.to_json() method which outputs data in spaCy's training format. This will be the only place where the format is hard-coded (see #2932).

Refactor CLI and add debug-data command to validate training data (see #2932).

🔴 Bug fixes

Fix issue #1487: Add Doc.retokenize() context manager.
Fix issue #1574: Make sure stop words are available in medium and large English models.
Fix issue #1665: Correct typos in symbol Animacy_inan and add Animacy_nhum.
Fix issue #1865: Correct licensing of it_core_news_sm model.
Fix issue #1889: Make stop words case-insensitive.
Fix issue #1903: Add relcl dependency label to symbols.
Fix issue #2014: Make Token.pos_ writeable.
Fix issue #2369: Respect pre-defined warning filters.
Fix issue #2482: Fix serialization when parser model is empty.
Fix issue #2671, #2675: Fix incorrect match ID on some patterns.
Fix issue #2772: Fix bug in sentence starts for non-projective parses.
Fix issue #2782: Make like_num work with prefixed numbers.
Fix serialization of custom tokenizer if not all functions are defined.
Fix bugs in beam-search training objective.
Fix problems with model pickling.

⚠️ Backwards incompatibilities

This version of spaCy requires downloading new models. You can use the spacy validate command to find out which models need updating, and print update instructions.
If you've been training your own models, you'll need to retrain them with the new version.
While the Matcher API is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that the Matcher in v2.1.x may produce different results compared to the Matcher in v2.0.x.
Also note that some of the model licenses have changed: it_core_news_sm is now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.

📈 Benchmarks

Model	Language	Version	UAS	LAS	POS	NER F	Vec	Size
`en_core_web_sm`	English	2.1.0a4	91.7	89.8	96.8	85.7	𐄂	12 MB
`en_core_web_md`	English	2.1.0a4	92.0	90.1	97.0	86.2	✓	93 MB
`en_core_web_lg`	English	2.1.0a4	92.1	90.3	97.0	86.5	✓	780 MB
`de_core_news_sm`	German	2.1.0a4	91.9	89.8	97.2	83.4	𐄂	12 MB
`de_core_news_md`	German	2.1.0a4	91.3	90.5	97.4	83.6	✓	212 MB
`es_core_news_sm`	Spanish	2.1.0a4	90.1	87.1	96.8	89.3	𐄂	12 MB
`es_core_news_md`	Spanish	2.1.0a4	90.7	87.8	97.1	89.4	✓	72 MB
`pt_core_news_sm`	Portuguese	2.1.0a4	89.2	85.8	79.8	82.4	𐄂	14 MB
`fr_core_news_sm`	French	2.1.0a4	87.2	84.0	94.4	67.0 ¹	𐄂	16 MB
`fr_core_news_md`	French	2.1.0a4	88.8	86.0	94.9	70.0 ¹	✓	84 MB
`it_core_news_sm`	Italian	2.1.0a4	90.6	87.0	96.0	81.7	𐄂	12 MB
`nl_core_news_sm`	Dutch	2.1.0a4	83.1	77.2	91.3	87.3	𐄂	12 MB
`el_core_news_sm`	Greek	2.1.0a4	84.2	80.4	94.6	71.5	𐄂	12 MB
`el_core_news_md`	Greek	2.1.0a4	87.5	84.1	96.4	78.3	✓	128 MB
`xx_ent_wiki_sm`	Multi	2.1.0a4	-	-	-	83.2	𐄂	4 MB

We're currently investigating this, as the results are anomalously low.

💬 UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e. Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).

📖 Documentation and examples

Fix various typos and inconsistencies.

👥 Contributors

Thanks to @DuyguA, @giannisdaras, @mgogoulos, @louridas and @skrcode for the pull requests and contributions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2.1.0a3: New models, ULMFit/BERT/Elmo-like pretraining, joint word segmentation and parsing, better Matcher, bug fixes & more