Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird Greek Lemmas #4241

Closed
polm opened this issue Sep 5, 2019 · 6 comments
Closed

Weird Greek Lemmas #4241

polm opened this issue Sep 5, 2019 · 6 comments
Labels
feat / lemmatizer Feature: Rule-based and lookup lemmatization help wanted (easy) Contributions welcome! (also suited for spaCy beginners) lang / el Greek language data and models

Comments

@polm
Copy link
Contributor

polm commented Sep 5, 2019

I can not read Greek, but looking at the nouns in the lemma_index.json file for Greek, these are the first several entries:

"(ιρλανδικά)", "(σκωτικά)", "(σοράνι)", "-αλγία", "-βατώ", "-βατῶ", "-ούλα", "-πληξία", "-ώνυμο", "sofa", "table", "άβακας", "άβατο", "άβατον"

I'm pretty sure parentheses don't belong there, and the things that begin with hyphens and "table" and "sofa" seem out of place.

Maybe this is due to a bug in the Wiktionary parsing script mentioned by @giannisdaras in #2558?

If someone who speaks Greek could check and clarify what, if anything, should be removed, that would be great.

@svlandeg svlandeg added feat / lemmatizer Feature: Rule-based and lookup lemmatization lang / el Greek language data and models help wanted (easy) Contributions welcome! (also suited for spaCy beginners) labels Sep 5, 2019
@GiorgioPorgio
Copy link

Greek here.

Parentheses don't belong there.
Stuff beginning with hyphens are suffixes so don't belong there.
'sofa', 'table' also don't belong there and I also spotted 'n-διάστατος' which is not a Greek lemma.

Can take care of this is you want :)
I can have a closer look at lemma_index.json to find anything that doesn't belong. Is it more helpful if I edit the file myself and push to the branch or just let you know which ones should be removed? Or some other approach (eg check the parsing script)?

@ines
Copy link
Member

ines commented Oct 22, 2019

@GiorgioPorgio Thanks! You could submit a PR to the spacy-lookups-data repo here: https:/explosion/spacy-lookups-data/

@GiorgioPorgio
Copy link

@ines Happy to help! Ready to submit my PR.
Also, 8 lemmas in total included the English 'o' char instead of Greek. Not sure why, but changed that these 'o's to Greek to be consistent and because they're different character codes.

One Q though:
Should I make the .github/contributors/giorgioporgio.md? I don't mind of course, but asking because it's such a tiny tiny contribution :)

@ines
Copy link
Member

ines commented Oct 23, 2019

@GiorgioPorgio The contributor agreement is up to you – for a small change like this, it's okay to leave it out for now.

@ines
Copy link
Member

ines commented Oct 25, 2019

See explosion/spacy-lookups-data#3 (note that the changes will be reflected in Greek with the next release of spacy-lookups-data, but the models still include the previous data until we retrain them).

@ines ines closed this as completed Oct 25, 2019
@lock
Copy link

lock bot commented Nov 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Nov 24, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / lemmatizer Feature: Rule-based and lookup lemmatization help wanted (easy) Contributions welcome! (also suited for spaCy beginners) lang / el Greek language data and models
Projects
None yet
Development

No branches or pull requests

4 participants