-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
💫 Improve annotation serialisation #1045
Labels
enhancement
Feature requests and improvements
🌙 nightly
Discussion and contributions related to nightly builds
⚠️ wip
Work in progress
Comments
ines
added
enhancement
Feature requests and improvements
🌙 nightly
Discussion and contributions related to nightly builds
⚠️ wip
Work in progress
labels
May 7, 2017
This was referenced May 7, 2017
See the v2.0.0 alpha release notes and #1105 🎉 |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Labels
enhancement
Feature requests and improvements
🌙 nightly
Discussion and contributions related to nightly builds
⚠️ wip
Work in progress
A persistent source of problems in spaCy 1.0 has been the way data is saved and loaded. This issue describes what's changing, and will be updated as implementation proceeds. The changes will finally allow annotations to support the Pickle protocol, making it much easier to use spaCy with Spark and other tools.
How saving a
Doc
works in 1.xAnnotations are exported as integer IDs, into a
numpy
array. A custom Huffman-tree implementation is used to store the following fields:ORTH
–- backing off to characters for unseen wordsSPACY
(boolean flag for whether the word has a space after it)HEAD
(as offset from token)TAG
(part of speech tag, must have entry in tag map)ENT_IOB
(IOB format for entity tags – one of 0, 1, 2, 3)ENT_TYPE
(Must have entry in strings.json)DEP
(Dependency label – must have entry in strings.json)To load the serialized bytes into a new
Doc
object, you need a matchingVocab
: to decode the bytes into the correct integer IDs, and then to map the integer IDs to the correct lexemes and tags.How saving a
Doc
will work in 2.xIndividual documents will now be serialized as a tuple (attrs, text). Instead of our own Huffman codec, we'll just store the
numpy
arrays directly –- these compress fine in bulk anyway, especially when merged.To save and load the document, you'll still need to have a reference to the
Vocab
object. To assist this, a newBinder
class will be introduced, which will allow a group of documents to be saved and loaded together.The
Binder
will support Pickle and JSON serialisation. More serialisation protocols can be added in future.The
Binder
will support two header styles: a standalone format, and a diff against a model ID. If you serialise the Binder standalone, you'll read out the whole vocab, so this will be large for a small set of documents. However, it will ensure you'll be able to load the documents correctly in future. The diff format will require you to have the appropriate model ID loaded. TheBinder
will then include in its header any extra information not in the base model, for instance missing strings.Summary of code changes
spacy.tokens.binder
module, withspacy.tokens.binder.Binder
classDoc
Token
,Span
objectsspacy.serialize
subpackageRelated issues
.pipe()
sometimes serialize into invalid byte strings #264, inconsistent sentence boundaries before and after serialization #322, compatibility of serialized Docs across spacy/vocab versions #343, Key error when serializing #350, spark does not utilize all cores when running a spacy parse or dies in serialization code #413, doc.to_bytes() fails with MemoryError: bad allocation #437, KeyError when serializing a doc object after adding a new entity label #514, Issues while loading Textacy corpus from disk. #530, to_bytes and from_bytes changes the token lemma #636, Loading NER model from dump doesn't work #664, Deserialization fails based on whether "nlp" object was used yet #728, Serializing user data in Doc objects #785, Deserialization fails with a new model instance. #927, UnicodeDecodeError in doc.from_bytes #985, High memory usage when deserializing Docs #992, Doc.to_bytes() throws KeyError on label-names for rule-matched spans #1011The text was updated successfully, but these errors were encountered: