Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements on setting attributes in merge_noun_chunks function #4107

Closed
alaponin opened this issue Aug 12, 2019 · 3 comments
Closed

Improvements on setting attributes in merge_noun_chunks function #4107

alaponin opened this issue Aug 12, 2019 · 3 comments
Labels
enhancement Feature requests and improvements feat / pipeline Feature: Processing pipeline and components

Comments

@alaponin
Copy link

Feature description

It would be nice if during the merging of noun chunks the lemma would also be properly set. Currently, the lemma of the merged token is the lemma of the first word in the noun chunk sequence, which I don't think is the desired behavior in most cases.
Additionally, it would be nice to have more flexibility in setting the attributes of the merged token. For example, I would like to keep the entity type of the root token as the entity type of the whole merged token.

@ines ines added enhancement Feature requests and improvements feat / pipeline Feature: Processing pipeline and components labels Aug 12, 2019
@ines
Copy link
Member

ines commented Aug 12, 2019

I think those both sound like reasonable defaults, so we should probably consider just adding them to the built-in function 🙂 Feel free to submit a PR btw.

We probably want to avoid introducing settings to the built-in component functions – it adds too much complexity for what they are (small wrappers around doc.retokenize), and once we add settings, we then also need to add serialization methods to preserve them.

Btw, also in case others come across this issue later: the merge_noun_chunks function itself is tiny, so if you ever need fully custom settings that might not be good defaults, I'd recommend just copying it and writing your own:

def merge_noun_chunks(doc):
"""Merge noun chunks into a single token.
doc (Doc): The Doc object.
RETURNS (Doc): The Doc object with merged noun chunks.
DOCS: https://spacy.io/api/pipeline-functions#merge_noun_chunks
"""
if not doc.is_parsed:
return doc
with doc.retokenize() as retokenizer:
for np in doc.noun_chunks:
attrs = {"tag": np.root.tag, "dep": np.root.dep}
retokenizer.merge(np, attrs=attrs)
return doc

@ines
Copy link
Member

ines commented Sep 8, 2019

Resolved by #4219!

@ines ines closed this as completed Sep 8, 2019
@lock
Copy link

lock bot commented Oct 8, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Oct 8, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement Feature requests and improvements feat / pipeline Feature: Processing pipeline and components
Projects
None yet
Development

No branches or pull requests

2 participants