-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correcting tokenization errors by parser #3818
Comments
Have you had a look at retokenization in spaCy? That allows you to update the attributes of tokens such as POS, using retokenization.merge |
Sure. I've been using retokenization APIs. In GiNZA, I'm using a logic which uses extended dependency labels; e.g. "obj_as_NOUN", to distinguish ambiguous POS, and also it uses a virtual root token appended after the last token in sentence to distinguish the POS of real root token; e.g. "root_as_VERB". https:/megagonlabs/ginza/blob/develop/ja_ginza/parse_tree.py#L445 This tricky logic is much complicated and also reducing performance. Thanks, @BreakBB |
@BreakBB Actually this refers to the parser-based mechanism, which uses the @hiroshi-matsuda-rit In the command-line interface, it should be as simple as adding
It sounds to me like your system would benefit from having several |
@honnibal I simply shared what I have found in the docs. Thanks for the clarification! |
@honnibal Thank you so much for you precise description around subtok concatenation procedure.
I'd like to report how I'd solve this problem, soon. |
Anyway, I think that a lot of applications of the world will be happy if they could use customized root labels. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Feature description
I'd like to implement a functionality which can correct tokenization errors (both boundaries and tags) by parser. With this error correction function, our Japanese language model will be able to resolve ambiguous POSs (such as サ変名詞 for NOUN or VERB) and merge over-segmented tokens.
I found a related mention in v2.1 release note.
Could you please give me the links for the source codes doing this? @honnibal
Can we apply "joint word segmentation and parsing" for single (and possibly root) token span?
The text was updated successfully, but these errors were encountered: