Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extension is not recognized when reloading spacy doc #4377

Closed
ghost opened this issue Oct 4, 2019 · 3 comments
Closed

Extension is not recognized when reloading spacy doc #4377

ghost opened this issue Oct 4, 2019 · 3 comments
Labels
docs Documentation and website feat / doc Feature: Doc, Span and Token objects usage General spaCy usage

Comments

@ghost
Copy link

ghost commented Oct 4, 2019

Hello,
custom extensions is not recognized when loading spacy doc object from disk. It does recognized it when the same kernal is active, however when restarting the kernal custom extensions are not recognized whereas other attributes are ok.
For example:

import spacy
from spacy.tokens import Doc, Span, Token

Doc.set_extension('doc_cat',force=True,default=None )
nlp = spacy.load("en_core_web_sm")
doc=nlp("I am trying to set an extension for doc object")
doc._.set("doc_cat","AN")
doc.to_disk("corpus/mydoc")

loaded_doc = Doc(nlp.vocab).from_disk("corpus/mydoc")
print(loaded_doc._.doc_cat )## output "AN"

after shutting down (restarting) the kernel, I got this error:

import spacy
from spacy.tokens import Doc, Span, Token
nlp = spacy.load("en_core_web_sm")
loaded_doc = Doc(nlp.vocab).from_disk("corpus/mydoc")
print(loaded_doc._.doc_cat)
AttributeError                            Traceback (most recent call last)
<ipython-input-1-0b8d370be645> in <module>
      3 nlp = spacy.load("en_core_web_sm")
      4 loaded_doc = Doc(nlp.vocab).from_disk("corpus/mydoc")
----> 5 loaded_doc._.doc_cat

~/anaconda3/envs/back/lib/python3.6/site-packages/spacy/tokens/underscore.py in __getattr__(self, name)
     33     def __getattr__(self, name):
     34         if name not in self._extensions:
---> 35             raise AttributeError(Errors.E046.format(name=name))
     36         default, method, getter, setter = self._extensions[name]
     37         if getter is not None:

AttributeError: [E046] Can't retrieve unregistered extension attribute 'doc_cat'. Did you forget to call the `set_extension` method?

Any idea why is that?

====================== Installed models (spaCy v2.2.1) ======================
ℹ spaCy installation:
anaconda3/envs/back/lib/python3.6/site-packages/spacy

TYPE NAME MODEL VERSION
package en-core-web-sm en_core_web_sm 2.2.0 ✔
package en-core-web-lg en_core_web_lg 2.2.0 ✔

@ines ines added feat / doc Feature: Doc, Span and Token objects usage General spaCy usage labels Oct 4, 2019
@ines
Copy link
Member

ines commented Oct 4, 2019

This is kind of expected – an extension attribute is a custom attribute registered on the global Doc class. When you serialize the a Doc, you can serialize the values of its user_data (which includes custom attribute values). But if you want to retrieve that value, you first need to register the extension attribute again.

Registering attributes automatically on deserialization would be problematic, because it'd make it very difficult to deal with conflicting attributes from different docs (e.g. if one doc has attr A and another has attr B, would all docs now have A and b?).

@ghost
Copy link
Author

ghost commented Oct 4, 2019

I see. Thanks. It would be great if I had this line in documentation (serialization section) .
" For custom extensions, It only serialize the values, you will need to register the extension attribute again to restore the value"
But still annoying if have many.

@ines ines added the docs Documentation and website label Oct 4, 2019
@ines ines closed this as completed in e65dffd Oct 5, 2019
@lock
Copy link

lock bot commented Nov 4, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Nov 4, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
docs Documentation and website feat / doc Feature: Doc, Span and Token objects usage General spaCy usage
Projects
None yet
Development

No branches or pull requests

1 participant