-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split one token into several #2838
Comments
Here are some implementation notes and questions.
Does that sound like it would work? I can implement it, if so. |
Merging tokens is simpler than splitting them because we can just iron out the internal structure of the span to be merged, ie. drop internal dependencies and redirect all other incoming and outgoing dependencies to the syntactic head (provided the merged tokens form a proper subtree).
The reverse operation (split a token) is trickier because we need to create an internal structure between the newly split tokens and stitch them to the surrounding tokens. The implementation you propose effectively imposes a fixed topology, where one of the new tokens is the head of all the other new tokens and receives all in- and out- dependency edges. I assume the (new) internal dependency edges would have a default label. |
I think the main issue here is the expected use case. Would the user know the complex structure of the splitted tokens dependencies in advance, or would they reparse, or use a simpler stucture most of the time? It could also make sense to not touch the dependencies at all, and to assume the document is either not parsed yet or will get reparsed afterwards. |
@grivaz Sorry for the delay getting to this. Definitely appreciate the help. I tried to implement token splitting during parsing earlier this year, as I wanted the parser to handle languages like Arabic by jointly predicting the tokenization. I wanted an action which would divide the second token of the buffer. This ended up being a huge mess, so I backed out the changes. Introducing a split method that works prior to parsing is much more feasible. For the parse tree, I think it would make sense to take array-valued arguments for the heads and dep labels. The For instance, If we had something like `split(doc, 4, 3, heads=[1, 0, -1]), the head of the new region would be one of the introduced tokens. To make that work, we'll need to repoint the heads that were going into the region, so that they point to this new head. Otherwise, what you suggest is good. The only thing I'd note is:
The capacity is in |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Feature description
We now have a feature to merge tokens into one, implemented in the retokenizer. We're lacking the reverse feature: split one token into several. An API exist in _retokenize, but is not implemented yet.
The text was updated successfully, but these errors were encountered: