Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correcting tokenization errors by parser #3818

Closed
hiroshi-matsuda-rit opened this issue Jun 4, 2019 · 7 comments
Closed

Correcting tokenization errors by parser #3818

hiroshi-matsuda-rit opened this issue Jun 4, 2019 · 7 comments
Labels
feat / parser Feature: Dependency Parser usage General spaCy usage

Comments

@hiroshi-matsuda-rit
Copy link
Contributor

Feature description

I'd like to implement a functionality which can correct tokenization errors (both boundaries and tags) by parser. With this error correction function, our Japanese language model will be able to resolve ambiguous POSs (such as サ変名詞 for NOUN or VERB) and merge over-segmented tokens.

I found a related mention in v2.1 release note.

Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.

Could you please give me the links for the source codes doing this? @honnibal
Can we apply "joint word segmentation and parsing" for single (and possibly root) token span?

@BreakBB
Copy link
Contributor

BreakBB commented Jun 4, 2019

Have you had a look at retokenization in spaCy?

That allows you to update the attributes of tokens such as POS, using retokenization.merge

@hiroshi-matsuda-rit
Copy link
Contributor Author

hiroshi-matsuda-rit commented Jun 4, 2019

Sure. I've been using retokenization APIs.

In GiNZA, I'm using a logic which uses extended dependency labels; e.g. "obj_as_NOUN", to distinguish ambiguous POS, and also it uses a virtual root token appended after the last token in sentence to distinguish the POS of real root token; e.g. "root_as_VERB".

https:/megagonlabs/ginza/blob/develop/ja_ginza/parse_tree.py#L445
https:/megagonlabs/ginza/blob/feature/apply_spacy_v2.1/ja_ginza/parse_tree.py#L433

This tricky logic is much complicated and also reducing performance.
I'd try to refactor it.

Thanks, @BreakBB

@honnibal
Copy link
Member

honnibal commented Jun 7, 2019

@BreakBB Actually this refers to the parser-based mechanism, which uses the subtok label. This is a bit different from the retokenization.

@hiroshi-matsuda-rit In the command-line interface, it should be as simple as adding --learn-tokens. The mechanism works like this:

  • In the GoldParse class, we receive a pair (doc, annotations), where the annotations includes the gold-standard segmentation, and the doc object contains the predicted tokenization. We then do a Levenshtein alignment between the two. The alignment is called in spacy/gold.pyx, and the main logic is in spacy/_align.pyx.
  • When the predicted tokenization over-segments, we set the gold-standard label on the non-final tokens of an over-segmented region to the special label subtok. The head for these subtok tokens will be the next word. This occurs in spacy/gold.pyx.
  • The parser then learns to predict these subtok labels. Additional constraints on this label ensure that the parser can only predict subtok for length-1 arcs, and that subtokens cannot cross sentence boundaries.
  • After parsing, the subtokens are merged using doc.retokenize(). This should be occurring in the merge_subtokens pipeline component in v2.1.4. In the next release, this will be moved into parser.postprocesses, to make the system more self-contained.

It sounds to me like your system would benefit from having several ROOT labels, which could be interpreted with different meanings. Currently the ROOT label is hard-coded, which prevents this.

@ines ines added feat / parser Feature: Dependency Parser usage General spaCy usage labels Jun 8, 2019
@BreakBB
Copy link
Contributor

BreakBB commented Jun 11, 2019

@honnibal I simply shared what I have found in the docs. Thanks for the clarification!

@hiroshi-matsuda-rit
Copy link
Contributor Author

hiroshi-matsuda-rit commented Jun 11, 2019

@honnibal Thank you so much for you precise description around subtok concatenation procedure.
I decided to replace GiNZA's POS disambiguation and retokenization procedures with spaCy's POS tagger and --learn-tokens, respectively.
The spaCy's train command works well with -G option but does not work with SudachiTokenizer (without -G).
It seems that we should retokenize the dataset by the tokenizer in advance, to avoid the inconsistent situations.
I encounter an error at the beginning of the first evaluation phase (just after first training phase).

python -m spacy train ja ja_gsd-ud ja_gsd-ud-train.json ja_gsd-ud-dev.json -p tagger,parser -ne 2 -V 1.2.2 -pt dep,tag -v models/ja_gsd-1.2.1/ -VV
...
✔ Saved model to output directory                                                                                                                                                         
ja_gsd-ud/model-final
⠙ Creating best model...
Traceback (most recent call last):
  File "/home/matsuda/.pyenv/versions/3.7.2/lib/python3.7/site-packages/spacy/cli/train.py", line 257, in train
    losses=losses,
  File "/home/matsuda/.pyenv/versions/3.7.2/lib/python3.7/site-packages/spacy/language.py", line 457, in update
    proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs)
  File "nn_parser.pyx", line 413, in spacy.syntax.nn_parser.Parser.update
  File "nn_parser.pyx", line 519, in spacy.syntax.nn_parser.Parser._init_gold_batch
  File "transition_system.pyx", line 86, in spacy.syntax.transition_system.TransitionSystem.get_oracle_sequence
  File "arc_eager.pyx", line 592, in spacy.syntax.arc_eager.ArcEager.set_costs
ValueError: [E020] Could not find a gold-standard action to supervise the dependency parser. The tree is non-projective (i.e. it has crossing arcs - see spacy/syntax/nonproj.pyx for definitions). The ArcEager transition system only supports projective trees. To learn non-projective representations, transform the data before training and after parsing. Either pass `make_projective=True` to the GoldParse class, or use spacy.syntax.nonproj.preprocess_training_data.

I'd like to report how I'd solve this problem, soon.

@hiroshi-matsuda-rit
Copy link
Contributor Author

hiroshi-matsuda-rit commented Jun 11, 2019

Anyway, I think that a lot of applications of the world will be happy if they could use customized root labels.

This was referenced Jun 11, 2019
@ines ines closed this as completed Sep 12, 2019
@lock
Copy link

lock bot commented Oct 12, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Oct 12, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / parser Feature: Dependency Parser usage General spaCy usage
Projects
None yet
Development

No branches or pull requests

4 participants