Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't SHAPE match against strings of digits that have leading 0's. #4615

Closed
mapadofu opened this issue Nov 9, 2019 · 6 comments
Closed

Can't SHAPE match against strings of digits that have leading 0's. #4615

mapadofu opened this issue Nov 9, 2019 · 6 comments
Labels
docs Documentation and website feat / doc Feature: Doc, Span and Token objects

Comments

@mapadofu
Copy link

mapadofu commented Nov 9, 2019

The SHAPE matching against tokens comprised of digits that have leading zeros, e.g. 01234 doesn't work correctly.

What I'd expect is that the pattern [{'SHAPE':'ddddd'}] would match against the token 01234; this is not the case. Another way to say it is that you can have a token with text=="01234" whose shape=="ddddd", but does not match that pattern.

How to reproduce the behaviour

import spacy
from spacy.matcher import Matcher

def run_test(model_name, pattern, text):
    nlp = spacy.load('en_core_web_md', disable=[])

    m = Matcher(nlp.vocab)
    m.add('Serial', None, pattern)

    doc = nlp(text)

    print("Text:", text)
    print("Tokens:", list(doc))
    print("Shapes:", list(t.shape_ for t in doc))
    print("Model: {0} ({1})".format(model_name, spacy.__version__))
    print("Pattern:", pattern)
    print("Matches:", list(doc[s:t] for (i,s,t) in m(doc)))


text = "I'm writing in reference to invoice no. 01234-0001"

model_name = 'en_core_web_sm'
# same behaviour for medium and large
pattern = [{'SHAPE':'ddddd'}, {'ORTH':'-'}, {'SHAPE':'dddd'}]

run_test(model_name, pattern, text)
print("Bad")
print()
print()

pattern = [{'TEXT':{'REGEX':r"\d{5}"}}, {'ORTH':'-'}, {'TEXT':{'REGEX':r"\d{4}"}}]
run_test(model_name, pattern, text)
print("Works") 
print()

Your Environment

  • spaCy version: 2.1.8
  • Platform: Linux-4.15.0-66-generic-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.6.8
@mapadofu
Copy link
Author

mapadofu commented Nov 9, 2019

Another weird case:

import spacy # 2.1.8
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_md', disable=[])
text = "I'm writing in reference to invoice no. 22867-0309"
pattern = [{'SHAPE':'ddddd'}, {'ORTH':'-'}, {'SHAPE':'dddd'}]

m = Matcher(nlp.vocab)
m.add('Serial', None, pattern)

doc = nlp(text)

print("Tokens:", list([(t.text, t.shape_) for t in doc]))
print("Matches:", list(docs[s:t] for (i,s,t) in m(doc)))

produces

Tokens: [('I', 'X'), ("'m", "'x"), ('writing', 'xxxx'), ('in', 'xx'), ('reference', 'xxxx'), ('to', 'xx'), ('invoice', 'xxxx'), ('no', 'xx'), ('.', '.'), ('22867', 'dddd'), ('-', '-'), ('0309', 'dddd')]
Matches: []

Note how 22867 has a shape dddd.

@adrianeboyd adrianeboyd added bug Bugs and behaviour differing from documentation feat / doc Feature: Doc, Span and Token objects docs Documentation and website and removed bug Bugs and behaviour differing from documentation labels Nov 9, 2019
@adrianeboyd
Copy link
Contributor

I thought this would be a bug, but it looks like it's the intended behavior. The current shape definition caps more than four of the same character type in a row to four in the output:

nlp("00000ddddd")[0].shape_        # 'ddddxxxx'
nlp("00000ddddd0000")[0].shape_    # 'ddddxxxxdddd'
nlp("00000dddDDdd0000")[0].shape_  # 'ddddxxxXXxxdddd'

shape is defined here:

def word_shape(text):
if len(text) >= 100:
return "LONG"
shape = []
last = ""
shape_char = ""
seq = 0
for char in text:
if char.isalpha():
if char.isupper():
shape_char = "X"
else:
shape_char = "x"
elif char.isdigit():
shape_char = "d"
else:
shape_char = char
if shape_char == last:
seq += 1
else:
seq = 0
last = shape_char
if seq < 4:
shape.append(shape_char)
return "".join(shape)

So it's not related to leading zeros. This isn't obvious behavior and it should be documented properly in the Token API!

@honnibal
Copy link
Member

Sorry for the lack of docs, this has always been the way word shape features have behaved, to reduce sparsity. I took this definition from some paper, I think either the Ratinov et al (2009) one, or possibly an early paper from Stanford.

@mapadofu
Copy link
Author

mapadofu commented Nov 23, 2019

Note that the German phone number example (2nd code example below https://spacy.io/usage/rule-based-matching#example2 ) shouldn't work since it uses {"SHAPE": "dddddd"} (I haven't checked myself yet)

@honnibal
Copy link
Member

honnibal commented Nov 23, 2019

A full description would be something like:

Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by "x" or "X", and numeric characters are replaced by "d", and sequences of the same character are truncated after length 4.

@ines ines closed this as completed in cbacb0f Nov 23, 2019
@lock
Copy link

lock bot commented Dec 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Dec 23, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
docs Documentation and website feat / doc Feature: Doc, Span and Token objects
Projects
None yet
Development

No branches or pull requests

3 participants