Split one token into several #2838

grivaz · 2018-10-10T19:43:34Z

Feature description

We now have a feature to merge tokens into one, implemented in the retokenizer. We're lacking the reverse feature: split one token into several. An API exist in _retokenize, but is not implemented yet.

grivaz · 2018-10-10T20:02:41Z

Here are some implementation notes and questions.

The current API doesn't specify any syntactic dependencies for the new tokens. This raises two issues:

Where dependencies that pointed to the original token should point to after the split. Giving the root of the new span as a parameter would allow tokens that pointed to the original token to point to the root of the newly created span.
There is no way of knowing the dependencies of the newly created tokens (that I can think of). Here too, having a root argument would allow to give that root as a head to all other newly created tokens, and to keep the original dependencies for the root token itself.

Here is how I would implement the split, given a root argument:
Double doc.c length if necessary (until big enough for all new tokens)
move tokens after the split to create space for the new tokens
Host the tokens in the newly created space
get LexemeC* for all new orths
set token.spacy to False for all non-last splited tokens, and to origToken.spacy for the last token
apply attrs to each subtoken
if origToken.iob == 3/begin, set the first subtoken to 3/begin, and all other subtokens to 1/in
in all other cases subtokens inherit iob from origToken
Adjust all heads with the offset, similar to the merge function
Heads that pointed to the token now point to the root.
All non root subtokens have the root token as head.
Root subtoken inherit dependencies from origToken.
set children from head

Does that sound like it would work? I can implement it, if so.

moreymat · 2018-10-12T09:22:44Z

Merging tokens is simpler than splitting them because we can just iron out the internal structure of the span to be merged, ie. drop internal dependencies and redirect all other incoming and outgoing dependencies to the syntactic head (provided the merged tokens form a proper subtree).
Note that:

The dropped internal dependencies can form different topologies, including but not restricted to the one where all merged tokens are direct dependents of a unique head in the span ;
Tokens other than the head can also have dependents on their own, which is fine because they become dependents of the merged token.

The reverse operation (split a token) is trickier because we need to create an internal structure between the newly split tokens and stitch them to the surrounding tokens. The implementation you propose effectively imposes a fixed topology, where one of the new tokens is the head of all the other new tokens and receives all in- and out- dependency edges. I assume the (new) internal dependency edges would have a default label.
Ideally, a token splitter should enable the user to explicitly specify the (internal and external) dependency structure they want. The mechanics involved quickly become intricate though, so...

grivaz · 2018-10-15T18:27:16Z

I think the main issue here is the expected use case. Would the user know the complex structure of the splitted tokens dependencies in advance, or would they reparse, or use a simpler stucture most of the time? It could also make sense to not touch the dependencies at all, and to assume the document is either not parsed yet or will get reparsed afterwards.

honnibal · 2018-10-26T21:44:32Z

@grivaz Sorry for the delay getting to this. Definitely appreciate the help.

I tried to implement token splitting during parsing earlier this year, as I wanted the parser to handle languages like Arabic by jointly predicting the tokenization. I wanted an action which would divide the second token of the buffer. This ended up being a huge mess, so I backed out the changes. Introducing a split method that works prior to parsing is much more feasible.

For the parse tree, I think it would make sense to take array-valued arguments for the heads and dep labels. The heads array would specify head offsets for the new tokens, which dictates the parse tree shape.

For instance, split(doc, 4, 3, heads=[0, -1, -1]) would insert two new tokens at position 4. The head of the new region would be the first token, with the second token attached to it, and the third token attached to the second one.

If we had something like `split(doc, 4, 3, heads=[1, 0, -1]), the head of the new region would be one of the introduced tokens. To make that work, we'll need to repoint the heads that were going into the region, so that they point to this new head.

Otherwise, what you suggest is good. The only thing I'd note is:

Double doc.c length if necessary (until big enough for all new tokens)

The capacity is in doc.max_length, while the current length is doc.length. See doc.push_back for reference.

lock · 2019-03-16T15:15:35Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added enhancement Feature requests and improvements feat / doc Feature: Doc, Span and Token objects labels Oct 12, 2018

grivaz added a commit to grivaz/spaCy that referenced this issue Feb 8, 2019

Add split one token into several (resolves explosion#2838)

da36b35

grivaz mentioned this issue Feb 8, 2019

Add split one token into several (resolves #2838) #3253

Merged

3 tasks

honnibal closed this as completed in 3981551 Feb 14, 2019

lock bot locked as resolved and limited conversation to collaborators Mar 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split one token into several #2838

Split one token into several #2838

grivaz commented Oct 10, 2018

grivaz commented Oct 10, 2018

moreymat commented Oct 12, 2018

grivaz commented Oct 15, 2018

honnibal commented Oct 26, 2018 •

edited

Loading

lock bot commented Mar 16, 2019

Split one token into several #2838

Split one token into several #2838

Comments

grivaz commented Oct 10, 2018

Feature description

grivaz commented Oct 10, 2018

moreymat commented Oct 12, 2018

grivaz commented Oct 15, 2018

honnibal commented Oct 26, 2018 • edited Loading

lock bot commented Mar 16, 2019

honnibal commented Oct 26, 2018 •

edited

Loading