Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix/fix entity ruler from disk #4670

Merged
merged 7 commits into from
Nov 21, 2019

Conversation

GuiGel
Copy link
Contributor

@GuiGel GuiGel commented Nov 19, 2019

Solve a bug in the EntityRuler from_disk method

Description

issue4651

The PhraseMatcher isn't initialized correctly in the EntityRuler from_disk method when the phrase_matcher_attr attribute is specified.
This is because the from_disk function is used before the recovery of the phrase_matcher_attr attribute.
A way to solve this bug is to split the deserializer in a 2 keys dict, the patterns and the cfg. By this way we can first deserialize the cfg and initialize the PhraseMatcher with it, and secondly use the patterns part in the from_disk function, in order to add the patterns the PhraseMatcher that have been correctly initialized.

My environnement:

spaCy version: 2.2.2
Platform: Linux-4.9.0-11-amd64-x86_64-with-debian-9.11
Python version: 3.7.3

Tests run

test_issue4651_with_phrase_matcher_attr --> test that failed without modification of the EntityRuler from_disk method and work after.
test_issue4651_without_phrase_matcher_attr --> test that work all the time.

Types of change

bug fix

Checklist

  • I have submitted the spaCy Contributor Agreement.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@adrianeboyd
Copy link
Contributor

Hi, thanks for submitting this PR! Your changes and tests look good, the main problems look like a few quirky things related to temporary files in the CI. (I ran into the same problems in my first PR with tests involving temporary files, it's not obvious!)

Try serializing the EntityRuler by itself (just to minimize the functionality that you're testing) and handling the temporary directories for to_disk() like this:

from ..util import make_tempdir

...
    nlp_reloaded = English()
    with make_tempdir() as d:
        file_path = d / "entityruler"
        ruler.to_disk(file_path)
        ruler_reloaded = EntityRuler(nlp_reloaded).from_disk(file_path)

    nlp_reloaded.add_pipe(ruler_reloaded)
    loaded_doc = nlp_reloaded(text)

You can see some examples in other tests like this one:

@pytest.mark.parametrize("Parser", test_parsers)
def test_serialize_parser_roundtrip_disk(en_vocab, Parser):
parser = Parser(en_vocab)
parser.model, _ = parser.Model(0)
with make_tempdir() as d:
file_path = d / "parser"
parser.to_disk(file_path)
parser_d = Parser(en_vocab)
parser_d.model, _ = parser_d.Model(0)
parser_d = parser_d.from_disk(file_path)
parser_bytes = parser.to_bytes(exclude=["model", "vocab"])
parser_d_bytes = parser_d.to_bytes(exclude=["model", "vocab"])
assert parser_bytes == parser_d_bytes

@ines ines added bug Bugs and behaviour differing from documentation feat / pipeline Feature: Processing pipeline and components labels Nov 20, 2019
@ines
Copy link
Member

ines commented Nov 21, 2019

Thanks a lot! 👍 Also really appreciate the extensive tests.

@ines ines merged commit 8f7ab70 into explosion:master Nov 21, 2019
@GuiGel
Copy link
Contributor Author

GuiGel commented Nov 21, 2019

Happy to contribute a little to your great library and thanks for you help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bugs and behaviour differing from documentation feat / pipeline Feature: Processing pipeline and components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants