v2.1.0a3: New models, ULMFit/BERT/Elmo-like pretraining, joint word segmentation and parsing, better Matcher, bug fixes & more
Pre-release🌙 This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly
. It's not intended for production use.
pip install -U spacy-nightly
If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models – see below for details and benchmarks.
⚠️ Due to difficulties linking our newblis
for faster platform-independent matrix multiplication, this nightly release currently doesn't work on Python 2.7 on Windows. We expect this problem to be corrected in the future.
✨ New features and improvements
Tagger, Parser & NER
- NEW: Experimental ULMFit/BERT/Elmo-like pretraining (see #2931) via the new
spacy pretrain
command. This pre-trains the CNN using BERT's cloze task. A new trick we're calling Language Modelling with Approximate Outputs is used to apply the pre-training to smaller models. The pre-training outputs CNN and embedding weights that can be used inspacy train
, using the new-t2v
argument. - NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
- Make parser, tagger and NER faster, through better hyperparameters.
- Add
EntityRecognizer.labels
property. - Remove document length limit during training, by implementing faster Levenshtein alignment.
- Use Thinc v7.0, which defaults to single-thread with fast
blis
kernel for matrix multiplication. Parallelisation should be performed at the task level, e.g. by running more containers.
Models & Language Data
- NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
- NEW: The English and German models are now available under the MIT license.
- NEW: Statistical models for Greek.
CLI
- NEW:
pretrain
command for ULMFit/BERT/Elmo-like pretraining (see #2931). - NEW: New
ud-train
command, to train and evaluate using the CoNLL 2017 shared task data. - Check if model is already installed before downloading it via
spacy download
. - Pass additional arguments of
download
command topip
to customise installation. - Improve
train
command by lettingGoldCorpus
stream data, instead of loading into memory. - Improve
init-model
command, including support for lexical attributes and word-vectors, using a variety of formats. This replaces thespacy vocab
command, which is now deprecated. - Add support for multi-task objectives to
train
command. - Add support for data-augmentation to
train
command.
Other
- NEW:
Doc.retokenize
context manager for merging tokens more efficiently. - NEW: Add support for custom pipeline component factories via entry points (#2348).
- NEW: Implement fastText vectors with subword features.
- NEW: Built-in rule-based NER component to add entities based on match patterns (see #2513).
- NEW: Allow
PhraseMatcher
to match on token attributes other thanORTH
, e.g.LOWER
(for case-insensitive matching) or evenPOS
orTAG
. - Add warnings if
.similarity
method is called with empty vectors or without word vectors. - Improve rule-based
Matcher
and addreturn_matches
keyword argument toMatcher.pipe
to yield(doc, matches)
tuples instead of onlyDoc
objects, andas_tuples
to add context to theDoc
objects. - Make stop words via
Token.is_stop
andLexeme.is_stop
case-insensitive.
🚧 Under construction
This section includes new features and improvements that are planned for the stable
v2.1.x
release, but aren't included in the nightly yet.
- Enhanced pattern API for rule-based
Matcher
(see #1971).- Improve tokenizer performance (see #1642).
- Allow retokenizer to update
Lexeme
attributes on merge (see #2390).md
andlg
models and new, pre-trained word vectors for German, French, Spanish, Italian, Portuguese and Dutch.- Improved JSON(L) format for training (see #2928, #2932).
Doc.to_json()
method which outputs data in spaCy's training format. This will be the only place where the format is hard-coded (see #2932).- Refactor CLI and add
debug-data
command to validate training data (see #2932).
🔴 Bug fixes
- Fix issue #1487: Add
Doc.retokenize()
context manager. - Fix issue #1574: Make sure stop words are available in medium and large English models.
- Fix issue #1665: Correct typos in symbol
Animacy_inan
and addAnimacy_nhum
. - Fix issue #1865: Correct licensing of
it_core_news_sm
model. - Fix issue #1889: Make stop words case-insensitive.
- Fix issue #1903: Add
relcl
dependency label to symbols. - Fix issue #2014: Make
Token.pos_
writeable. - Fix issue #2369: Respect pre-defined warning filters.
- Fix issue #2482: Fix serialization when parser model is empty.
- Fix issue #2671, #2675: Fix incorrect match ID on some patterns.
- Fix issue #2772: Fix bug in sentence starts for non-projective parses.
- Fix issue #2782: Make
like_num
work with prefixed numbers. - Fix serialization of custom tokenizer if not all functions are defined.
- Fix bugs in beam-search training objective.
- Fix problems with model pickling.
⚠️ Backwards incompatibilities
- This version of spaCy requires downloading new models. You can use the
spacy validate
command to find out which models need updating, and print update instructions. - If you've been training your own models, you'll need to retrain them with the new version.
- While the
Matcher
API is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that theMatcher
inv2.1.x
may produce different results compared to theMatcher
inv2.0.x
. - Also note that some of the model licenses have changed:
it_core_news_sm
is now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.
📈 Benchmarks
Model | Language | Version | UAS | LAS | POS | NER F | Vec | Size |
---|---|---|---|---|---|---|---|---|
en_core_web_sm |
English | 2.1.0a4 | 91.7 | 89.8 | 96.8 | 85.7 | 𐄂 | 12 MB |
en_core_web_md |
English | 2.1.0a4 | 92.0 | 90.1 | 97.0 | 86.2 | ✓ | 93 MB |
en_core_web_lg |
English | 2.1.0a4 | 92.1 | 90.3 | 97.0 | 86.5 | ✓ | 780 MB |
de_core_news_sm |
German | 2.1.0a4 | 91.9 | 89.8 | 97.2 | 83.4 | 𐄂 | 12 MB |
de_core_news_md |
German | 2.1.0a4 | 91.3 | 90.5 | 97.4 | 83.6 | ✓ | 212 MB |
es_core_news_sm |
Spanish | 2.1.0a4 | 90.1 | 87.1 | 96.8 | 89.3 | 𐄂 | 12 MB |
es_core_news_md |
Spanish | 2.1.0a4 | 90.7 | 87.8 | 97.1 | 89.4 | ✓ | 72 MB |
pt_core_news_sm |
Portuguese | 2.1.0a4 | 89.2 | 85.8 | 79.8 | 82.4 | 𐄂 | 14 MB |
fr_core_news_sm |
French | 2.1.0a4 | 87.2 | 84.0 | 94.4 | 67.0 1 | 𐄂 | 16 MB |
fr_core_news_md |
French | 2.1.0a4 | 88.8 | 86.0 | 94.9 | 70.0 1 | ✓ | 84 MB |
it_core_news_sm |
Italian | 2.1.0a4 | 90.6 | 87.0 | 96.0 | 81.7 | 𐄂 | 12 MB |
nl_core_news_sm |
Dutch | 2.1.0a4 | 83.1 | 77.2 | 91.3 | 87.3 | 𐄂 | 12 MB |
el_core_news_sm |
Greek | 2.1.0a4 | 84.2 | 80.4 | 94.6 | 71.5 | 𐄂 | 12 MB |
el_core_news_md |
Greek | 2.1.0a4 | 87.5 | 84.1 | 96.4 | 78.3 | ✓ | 128 MB |
xx_ent_wiki_sm |
Multi | 2.1.0a4 | - | - | - | 83.2 | 𐄂 | 4 MB |
- We're currently investigating this, as the results are anomalously low.
💬 UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e.
Token.tag_
). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).
📖 Documentation and examples
- Fix various typos and inconsistencies.
👥 Contributors
Thanks to @DuyguA, @giannisdaras, @mgogoulos, @louridas and @skrcode for the pull requests and contributions.