Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split one token into several #2838

Closed
grivaz opened this issue Oct 10, 2018 · 5 comments
Closed

Split one token into several #2838

grivaz opened this issue Oct 10, 2018 · 5 comments
Labels
enhancement Feature requests and improvements feat / doc Feature: Doc, Span and Token objects

Comments

@grivaz
Copy link
Contributor

grivaz commented Oct 10, 2018

Feature description

We now have a feature to merge tokens into one, implemented in the retokenizer. We're lacking the reverse feature: split one token into several. An API exist in _retokenize, but is not implemented yet.

@grivaz
Copy link
Contributor Author

grivaz commented Oct 10, 2018

Here are some implementation notes and questions.

  • The current API doesn't specify any syntactic dependencies for the new tokens. This raises two issues:
  1. Where dependencies that pointed to the original token should point to after the split. Giving the root of the new span as a parameter would allow tokens that pointed to the original token to point to the root of the newly created span.
  2. There is no way of knowing the dependencies of the newly created tokens (that I can think of). Here too, having a root argument would allow to give that root as a head to all other newly created tokens, and to keep the original dependencies for the root token itself.
  • Here is how I would implement the split, given a root argument:
    Double doc.c length if necessary (until big enough for all new tokens)
    move tokens after the split to create space for the new tokens
    Host the tokens in the newly created space
    get LexemeC* for all new orths
    set token.spacy to False for all non-last splited tokens, and to origToken.spacy for the last token
    apply attrs to each subtoken
    if origToken.iob == 3/begin, set the first subtoken to 3/begin, and all other subtokens to 1/in
    in all other cases subtokens inherit iob from origToken
    Adjust all heads with the offset, similar to the merge function
    Heads that pointed to the token now point to the root.
    All non root subtokens have the root token as head.
    Root subtoken inherit dependencies from origToken.
    set children from head

Does that sound like it would work? I can implement it, if so.

@moreymat
Copy link
Contributor

Merging tokens is simpler than splitting them because we can just iron out the internal structure of the span to be merged, ie. drop internal dependencies and redirect all other incoming and outgoing dependencies to the syntactic head (provided the merged tokens form a proper subtree).
Note that:

  1. The dropped internal dependencies can form different topologies, including but not restricted to the one where all merged tokens are direct dependents of a unique head in the span ;
  2. Tokens other than the head can also have dependents on their own, which is fine because they become dependents of the merged token.

The reverse operation (split a token) is trickier because we need to create an internal structure between the newly split tokens and stitch them to the surrounding tokens. The implementation you propose effectively imposes a fixed topology, where one of the new tokens is the head of all the other new tokens and receives all in- and out- dependency edges. I assume the (new) internal dependency edges would have a default label.
Ideally, a token splitter should enable the user to explicitly specify the (internal and external) dependency structure they want. The mechanics involved quickly become intricate though, so...

@ines ines added enhancement Feature requests and improvements feat / doc Feature: Doc, Span and Token objects labels Oct 12, 2018
@grivaz
Copy link
Contributor Author

grivaz commented Oct 15, 2018

I think the main issue here is the expected use case. Would the user know the complex structure of the splitted tokens dependencies in advance, or would they reparse, or use a simpler stucture most of the time? It could also make sense to not touch the dependencies at all, and to assume the document is either not parsed yet or will get reparsed afterwards.

@honnibal
Copy link
Member

honnibal commented Oct 26, 2018

@grivaz Sorry for the delay getting to this. Definitely appreciate the help.

I tried to implement token splitting during parsing earlier this year, as I wanted the parser to handle languages like Arabic by jointly predicting the tokenization. I wanted an action which would divide the second token of the buffer. This ended up being a huge mess, so I backed out the changes. Introducing a split method that works prior to parsing is much more feasible.

For the parse tree, I think it would make sense to take array-valued arguments for the heads and dep labels. The heads array would specify head offsets for the new tokens, which dictates the parse tree shape.

For instance, split(doc, 4, 3, heads=[0, -1, -1]) would insert two new tokens at position 4. The head of the new region would be the first token, with the second token attached to it, and the third token attached to the second one.

If we had something like `split(doc, 4, 3, heads=[1, 0, -1]), the head of the new region would be one of the introduced tokens. To make that work, we'll need to repoint the heads that were going into the region, so that they point to this new head.

Otherwise, what you suggest is good. The only thing I'd note is:

Double doc.c length if necessary (until big enough for all new tokens)

The capacity is in doc.max_length, while the current length is doc.length. See doc.push_back for reference.

@lock
Copy link

lock bot commented Mar 16, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Mar 16, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement Feature requests and improvements feat / doc Feature: Doc, Span and Token objects
Projects
None yet
Development

No branches or pull requests

4 participants