Improvements on setting attributes in merge_noun_chunks function #4107

alaponin · 2019-08-12T12:18:14Z

Feature description

It would be nice if during the merging of noun chunks the lemma would also be properly set. Currently, the lemma of the merged token is the lemma of the first word in the noun chunk sequence, which I don't think is the desired behavior in most cases.
Additionally, it would be nice to have more flexibility in setting the attributes of the merged token. For example, I would like to keep the entity type of the root token as the entity type of the whole merged token.

ines · 2019-08-12T14:54:21Z

I think those both sound like reasonable defaults, so we should probably consider just adding them to the built-in function 🙂 Feel free to submit a PR btw.

We probably want to avoid introducing settings to the built-in component functions – it adds too much complexity for what they are (small wrappers around doc.retokenize), and once we add settings, we then also need to add serialization methods to preserve them.

Btw, also in case others come across this issue later: the merge_noun_chunks function itself is tiny, so if you ever need fully custom settings that might not be good defaults, I'd recommend just copying it and writing your own:

spaCy/spacy/pipeline/functions.py

Lines 7 to 21 in 250a544

 def merge_noun_chunks(doc): 

 """Merge noun chunks into a single token. 

  doc (Doc): The Doc object. 

  RETURNS (Doc): The Doc object with merged noun chunks. 

  DOCS: https://spacy.io/api/pipeline-functions#merge_noun_chunks 

  """ 

 if not doc.is_parsed: 

 return doc 

 with doc.retokenize() as retokenizer: 

 for np in doc.noun_chunks: 

 attrs = {"tag": np.root.tag, "dep": np.root.dep} 

 retokenizer.merge(np, attrs=attrs) 

 return doc

ines · 2019-09-08T11:08:19Z

Resolved by #4219!

lock · 2019-10-08T11:42:55Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added enhancement Feature requests and improvements feat / pipeline Feature: Processing pipeline and components labels Aug 12, 2019

alaponin mentioned this issue Aug 16, 2019

Add lemma and ent_type attributes to the merge_noun_chunks function #4130

Closed

3 tasks

ines closed this as completed Sep 8, 2019

lock bot locked as resolved and limited conversation to collaborators Oct 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements on setting attributes in merge_noun_chunks function #4107

Improvements on setting attributes in merge_noun_chunks function #4107

alaponin commented Aug 12, 2019

ines commented Aug 12, 2019

ines commented Sep 8, 2019

lock bot commented Oct 8, 2019

Improvements on setting attributes in merge_noun_chunks function #4107

Improvements on setting attributes in merge_noun_chunks function #4107

Comments

alaponin commented Aug 12, 2019

Feature description

ines commented Aug 12, 2019

ines commented Sep 8, 2019

lock bot commented Oct 8, 2019