Can't SHAPE match against strings of digits that have leading 0's. #4615

mapadofu · 2019-11-09T00:56:35Z

The SHAPE matching against tokens comprised of digits that have leading zeros, e.g. 01234 doesn't work correctly.

What I'd expect is that the pattern [{'SHAPE':'ddddd'}] would match against the token 01234; this is not the case. Another way to say it is that you can have a token with text=="01234" whose shape=="ddddd", but does not match that pattern.

How to reproduce the behaviour

import spacy
from spacy.matcher import Matcher

def run_test(model_name, pattern, text):
    nlp = spacy.load('en_core_web_md', disable=[])

    m = Matcher(nlp.vocab)
    m.add('Serial', None, pattern)

    doc = nlp(text)

    print("Text:", text)
    print("Tokens:", list(doc))
    print("Shapes:", list(t.shape_ for t in doc))
    print("Model: {0} ({1})".format(model_name, spacy.__version__))
    print("Pattern:", pattern)
    print("Matches:", list(doc[s:t] for (i,s,t) in m(doc)))


text = "I'm writing in reference to invoice no. 01234-0001"

model_name = 'en_core_web_sm'
# same behaviour for medium and large
pattern = [{'SHAPE':'ddddd'}, {'ORTH':'-'}, {'SHAPE':'dddd'}]

run_test(model_name, pattern, text)
print("Bad")
print()
print()

pattern = [{'TEXT':{'REGEX':r"\d{5}"}}, {'ORTH':'-'}, {'TEXT':{'REGEX':r"\d{4}"}}]
run_test(model_name, pattern, text)
print("Works") 
print()

Your Environment

spaCy version: 2.1.8
Platform: Linux-4.15.0-66-generic-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.8

The text was updated successfully, but these errors were encountered:

mapadofu · 2019-11-09T01:42:23Z

Another weird case:

import spacy # 2.1.8
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_md', disable=[])
text = "I'm writing in reference to invoice no. 22867-0309"
pattern = [{'SHAPE':'ddddd'}, {'ORTH':'-'}, {'SHAPE':'dddd'}]

m = Matcher(nlp.vocab)
m.add('Serial', None, pattern)

doc = nlp(text)

print("Tokens:", list([(t.text, t.shape_) for t in doc]))
print("Matches:", list(docs[s:t] for (i,s,t) in m(doc)))

produces

Tokens: [('I', 'X'), ("'m", "'x"), ('writing', 'xxxx'), ('in', 'xx'), ('reference', 'xxxx'), ('to', 'xx'), ('invoice', 'xxxx'), ('no', 'xx'), ('.', '.'), ('22867', 'dddd'), ('-', '-'), ('0309', 'dddd')]
Matches: []

Note how 22867 has a shape dddd.

adrianeboyd · 2019-11-11T07:52:50Z

I thought this would be a bug, but it looks like it's the intended behavior. The current shape definition caps more than four of the same character type in a row to four in the output:

nlp("00000ddddd")[0].shape_        # 'ddddxxxx'
nlp("00000ddddd0000")[0].shape_    # 'ddddxxxxdddd'
nlp("00000dddDDdd0000")[0].shape_  # 'ddddxxxXXxxdddd'

shape is defined here:

spaCy/spacy/lang/lex_attrs.py

Lines 150 to 174 in 4d85f67

 def word_shape(text): 

 if len(text) >= 100: 

 return "LONG" 

 shape = [] 

 last = "" 

 shape_char = "" 

 seq = 0 

 for char in text: 

 if char.isalpha(): 

 if char.isupper(): 

 shape_char = "X" 

 else: 

 shape_char = "x" 

 elif char.isdigit(): 

 shape_char = "d" 

 else: 

 shape_char = char 

 if shape_char == last: 

 seq += 1 

 else: 

 seq = 0 

 last = shape_char 

 if seq < 4: 

 shape.append(shape_char) 

 return "".join(shape)

So it's not related to leading zeros. This isn't obvious behavior and it should be documented properly in the Token API!

honnibal · 2019-11-23T14:16:06Z

Sorry for the lack of docs, this has always been the way word shape features have behaved, to reduce sparsity. I took this definition from some paper, I think either the Ratinov et al (2009) one, or possibly an early paper from Stanford.

mapadofu · 2019-11-23T15:30:18Z

Note that the German phone number example (2nd code example below https://spacy.io/usage/rule-based-matching#example2 ) shouldn't work since it uses {"SHAPE": "dddddd"} (I haven't checked myself yet)

honnibal · 2019-11-23T16:02:49Z

A full description would be something like:

Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by "x" or "X", and numeric characters are replaced by "d", and sequences of the same character are truncated after length 4.

lock · 2019-12-23T17:47:31Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

adrianeboyd added bug Bugs and behaviour differing from documentation feat / doc Feature: Doc, Span and Token objects docs Documentation and website and removed bug Bugs and behaviour differing from documentation labels Nov 9, 2019

ines closed this as completed in cbacb0f Nov 23, 2019

lock bot locked as resolved and limited conversation to collaborators Dec 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't SHAPE match against strings of digits that have leading 0's. #4615

Can't SHAPE match against strings of digits that have leading 0's. #4615

mapadofu commented Nov 9, 2019 •

edited

Loading

mapadofu commented Nov 9, 2019 •

edited

Loading

adrianeboyd commented Nov 11, 2019

honnibal commented Nov 23, 2019

mapadofu commented Nov 23, 2019 •

edited

Loading

honnibal commented Nov 23, 2019 •

edited

Loading

lock bot commented Dec 23, 2019

Can't SHAPE match against strings of digits that have leading 0's. #4615

Can't SHAPE match against strings of digits that have leading 0's. #4615

Comments

mapadofu commented Nov 9, 2019 • edited Loading

How to reproduce the behaviour

Your Environment

mapadofu commented Nov 9, 2019 • edited Loading

adrianeboyd commented Nov 11, 2019

honnibal commented Nov 23, 2019

mapadofu commented Nov 23, 2019 • edited Loading

honnibal commented Nov 23, 2019 • edited Loading

lock bot commented Dec 23, 2019

mapadofu commented Nov 9, 2019 •

edited

Loading

mapadofu commented Nov 9, 2019 •

edited

Loading

mapadofu commented Nov 23, 2019 •

edited

Loading

honnibal commented Nov 23, 2019 •

edited

Loading