Skip to content

Commit

Permalink
Improve docs on phrase pattern attributes (closes #4100) [ci skip]
Browse files Browse the repository at this point in the history
  • Loading branch information
ines committed Aug 11, 2019
1 parent 1f4d8bf commit 1362f79
Showing 1 changed file with 19 additions and 5 deletions.
24 changes: 19 additions & 5 deletions website/docs/usage/rule-based-matching.md
Original file line number Diff line number Diff line change
Expand Up @@ -788,11 +788,11 @@ token pattern covering the exact tokenization of the term.
To create the patterns, each phrase has to be processed with the `nlp` object.
If you have a mode loaded, doing this in a loop or list comprehension can easily
become inefficient and slow. If you only need the tokenization and lexical
attributes, you can run [`nlp.make_doc`](/api/language#make_doc) instead, which
will only run the tokenizer. For an additional speed boost, you can also use the
[`nlp.tokenizer.pipe`](/api/tokenizer#pipe) method, which will process the texts
as a stream.
become inefficient and slow. If you **only need the tokenization and lexical
attributes**, you can run [`nlp.make_doc`](/api/language#make_doc) instead,
which will only run the tokenizer. For an additional speed boost, you can also
use the [`nlp.tokenizer.pipe`](/api/tokenizer#pipe) method, which will process
the texts as a stream.
```diff
- patterns = [nlp(term) for term in LOTS_OF_TERMS]
Expand Down Expand Up @@ -825,6 +825,20 @@ for match_id, start, end in matcher(doc):
print("Matched based on lowercase token text:", doc[start:end])
```
<Infobox title="Important note on creating patterns" variant="warning">
The examples here use [`nlp.make_doc`](/api/language#make_doc) to create `Doc`
object patterns as efficiently as possible and without running any of the other
pipeline components. If the token attribute you want to match on are set by a
pipeline component, **make sure that the pipeline component runs** when you
create the pattern. For example, to match on `POS` or `LEMMA`, the pattern `Doc`
objects need to have part-of-speech tags set by the `tagger`. You can either
call the `nlp` object on your pattern texts instead of `nlp.make_doc`, or use
[`nlp.disable_pipes`](/api/language#disable_pipes) to disable components
selectively.
</Infobox>
Another possible use case is matching number tokens like IP addresses based on
their shape. This means that you won't have to worry about how those string will
be tokenized and you'll be able to find tokens and combinations of tokens based
Expand Down

0 comments on commit 1362f79

Please sign in to comment.