Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiprocessing in pipe() not working with custom attributes #4903

Closed
BramVanroy opened this issue Jan 13, 2020 · 14 comments · Fixed by #5006
Closed

Multiprocessing in pipe() not working with custom attributes #4903

BramVanroy opened this issue Jan 13, 2020 · 14 comments · Fixed by #5006
Labels
bug Bugs and behaviour differing from documentation compat Cross-platform and cross-Python compatibility feat / doc Feature: Doc, Span and Token objects scaling Scaling, serving and parallelizing spaCy

Comments

@BramVanroy
Copy link
Contributor

BramVanroy commented Jan 13, 2020

I am using custom attributes and a custom pipeline for the first time, so it might very well be that there is a mistake on my part. However, the code works fine when not using the n_process argument.

The problem: using nlp.pipe(text, n_process=2) will throw an AttributeError complaining that I am assiging a value to an unregistered extension attribute. The error is not thrown without the n_process argument.

Trace:

Process Process-1:
Traceback (most recent call last):
  File "C:\Python\Python37\Lib\multiprocessing\process.py", line 297, in _bootstrap
    self.run()
  File "C:\Python\Python37\Lib\multiprocessing\process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\bmvroy\.virtualenvs\spacy_conll-PtwHJ_vN\lib\site-packages\spacy\language.py", line 1124, in _apply_pipes
    sender.send([doc.to_bytes() for doc in docs])
  File "C:\Users\bmvroy\.virtualenvs\spacy_conll-PtwHJ_vN\lib\site-packages\spacy\language.py", line 1124, in <listcomp>
    sender.send([doc.to_bytes() for doc in docs])
  File "nn_parser.pyx", line 248, in pipe
  File "C:\Users\bmvroy\.virtualenvs\spacy_conll-PtwHJ_vN\lib\site-packages\spacy\util.py", line 481, in minibatch
    batch = list(itertools.islice(items, int(batch_size)))
  File "C:\Users\bmvroy\.virtualenvs\spacy_conll-PtwHJ_vN\lib\site-packages\spacy\language.py", line 1106, in _pipe
    doc = proc(doc, **kwargs)
  File "C:\Users\bmvroy\.PyCharm2019.2\config\scratches\scratch_33.py", line 15, in __call__
    sent._.set('my_ext', sent_ext)
  File "C:\Users\bmvroy\.virtualenvs\spacy_conll-PtwHJ_vN\lib\site-packages\spacy\tokens\underscore.py", line 71, in set
    return self.__setattr__(name, value)
  File "C:\Users\bmvroy\.virtualenvs\spacy_conll-PtwHJ_vN\lib\site-packages\spacy\tokens\underscore.py", line 63, in __setattr__
    raise AttributeError(Errors.E047.format(name=name))
AttributeError: [E047] Can't assign a value to unregistered extension attribute 'my_ext'. Did you forget to call the `set_extension` method?

My guess would be that n_process creates new processes that recreate the nlp instance, but does not reinitialize its pipes - but that's just a guess.

How to reproduce the behaviour

import spacy
from spacy.tokens import Span, Doc

class CustomPipe:
    name = 'my_pipe'

    def __init__(self):
        Span.set_extension('my_ext', getter=self._get_my_ext)
        Doc.set_extension('my_ext', default=None)

    def __call__(self, doc):
        gathered_ext = []
        for sent in doc.sents:
            sent_ext = self._get_my_ext(sent)
            sent._.set('my_ext', sent_ext)
            gathered_ext.append(sent_ext)

        doc._.set('my_ext', '\n'.join(gathered_ext))

        return doc

    @staticmethod
    def _get_my_ext(span):
        return str(span.end)


if __name__ == '__main__':
    nlp = spacy.load('en_core_web_sm')
    custom_component = CustomPipe()
    nlp.add_pipe(custom_component, after='parser')

    text = ['I like bananas.', 'Do you like them?', 'No, I prefer wasabi.']
    # works without 'n_process' 
    for doc in nlp.pipe(text, n_process=2):
        print(doc)

Your Environment

  • spaCy version: 2.2.3
  • Platform: Windows-10-10.0.18362-SP0
  • Python version: 3.7.3
  • Models: en
@adrianeboyd adrianeboyd added compat Cross-platform and cross-Python compatibility feat / doc Feature: Doc, Span and Token objects usage General spaCy usage labels Jan 13, 2020
@adrianeboyd
Copy link
Contributor

I think the not-particularly-satisfying solution is to check that the extensions are registered in __call__, see: #4737 (comment)

You can check whether they're already there (if not Token.has_extension...) so that if you're using linux, you don't notice much difference.

@BramVanroy
Copy link
Contributor Author

BramVanroy commented Jan 13, 2020

Ah yes, I thought the issue was familiar but I couldn't find it. Apparently the linked issue was closed, but I think this issue should stay open. It would be nice to get a permanent fix for this.

Not verify satisfying, indeed, but if it works, I am satisfied with a dirty workaround for now. I'm not sure whether checking if not Token.has_extension first makes a difference? How would the behaviour differ, then, between linux and windows? Thanks.

@adrianeboyd
Copy link
Contributor

If they've been set before in __init__, you'll get errors unless you have force=True, so you want to check before setting them again. (I'm assuming checking with if for an existing extension is faster than setting with force=True, but I haven't actually tested it.)

@BramVanroy
Copy link
Contributor Author

If they've been set before in __init__, you'll get errors unless you have force=True, so you want to check before setting them again. (I'm assuming checking with if for an existing extension is faster than setting with force=True, but I haven't actually tested it.)

Seems like using force=True or manually using if not Token.has_extension is actually practically the same in terms of if-checking, so I'll stick with force=True. Thanks, didn't know that force existed.

if cls.has_extension(name) and not kwargs.get("force", False):
raise ValueError(Errors.E090.format(name=name, obj="Token"))

I am really curious whether the underlying issue can actually be fixed, i.e. that also on Windows the Token.set_extension() in init would work, and what is actually causing it.

BramVanroy pushed a commit to BramVanroy/spacy_conll that referenced this issue Jan 15, 2020
On Windows, set_extension in init does not work properly when using multiprocessing. As an alternative, set them (again) in __call__. Also see explosion/spaCy#4903
@svlandeg svlandeg added the windows Issues related to Windows label Jan 17, 2020
@svlandeg
Copy link
Member

svlandeg commented Jan 17, 2020

Can replicate this on my system (though as a unit test, the code hangs), definitely looks like a bug on Windows.

@svlandeg svlandeg added bug Bugs and behaviour differing from documentation scaling Scaling, serving and parallelizing spaCy and removed compat Cross-platform and cross-Python compatibility usage General spaCy usage labels Jan 17, 2020
@BramVanroy
Copy link
Contributor Author

BramVanroy commented Jan 17, 2020

Can replicate this on my system (though as a unit test, the code hangs), definitely looks like a bug on Windows.

Hm, that's odd. Just now, I copy-pasted the snippet from my OP into a new environment and it runs fine (when you leave out the n_process argument), and throws an error when including the argument. You are right though, in that when the error is thrown the interpreter doesn't exit(). My guess is that the error is thrown in a child process, but that the other process(es) or the main process then just blocks - something along those lines.

EDIT: Oops, I was replying to your previous entry about hanging and reproducibility.

@svlandeg
Copy link
Member

Yep I assume it's blocking on a child process, but I could still reproduce the error so there's a chance of debugging ;-)

@adrianeboyd
Copy link
Contributor

@svlandeg: I don't think you need to do any particular windows debugging here. You can see the same behavior in linux if you use spawn instead of fork.

I don't see an obvious way to fix this without completely changing how custom extensions are implemented. Maybe useful warnings are possible, though?

@svlandeg
Copy link
Member

So do I understand it correctly that it's a Windows-specific bug in spaCy because Windows defaults to spawn, and Linux to fork?

@adrianeboyd
Copy link
Contributor

I'd say that it looks like a windows-specific bug because of the multiprocessing defaults, but it's really an issue with spawn and spacy's global variables.

@svlandeg svlandeg added osx Issues related to macOS / OSX compat Cross-platform and cross-Python compatibility and removed osx Issues related to macOS / OSX windows Issues related to Windows labels Feb 12, 2020
@svlandeg
Copy link
Member

Ok I looked into this some more, and from what I've read, it's bad practice to rely on the current state to be transferred to the workers (and it'll only work on Linux, not macOS or Windows). Instead, the required state should be transferred, cf https://docs.python.org/3/library/multiprocessing.html#programming-guidelines:

On Unix using the fork start method, a child process can make use of a shared resource created in a parent process using a global resource. However, it is better to pass the object as an argument to the constructor for the child process.
Apart from making the code (potentially) compatible with Windows and the other start methods this also ensures that as long as the child process is still alive the object will not be garbage collected in the parent process. This might be important if some resource is freed when the object is garbage collected in the parent process.

Fixed in PR #5006 by specifically loading the Underscore "state".

Thanks again for the report and the helpful code snippet, @BramVanroy !

@BramVanroy
Copy link
Contributor Author

Ah, that's great! I feared this issue was going to be ignored because "meh, Windows". I am glad it did get more attention. Thanks @svlandeg! You can close this if you want (I didn't yet because the tests in the PR failed but feel free).

@svlandeg
Copy link
Member

No that's fine, this issue will close automatically if/when the PR gets merged ;-)

@lock
Copy link

lock bot commented Mar 27, 2020

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Mar 27, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation compat Cross-platform and cross-Python compatibility feat / doc Feature: Doc, Span and Token objects scaling Scaling, serving and parallelizing spaCy
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants