Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[de] no spelling suggestion for 'Postleidzahl' #725

Closed
janschreiber opened this issue Jun 21, 2017 · 22 comments
Closed

[de] no spelling suggestion for 'Postleidzahl' #725

janschreiber opened this issue Jun 21, 2017 · 22 comments

Comments

@janschreiber
Copy link
Contributor

The German spell checker does not come up with any suggestions for 'Postleidzahl'. Looks like a bug IMO.

@danielnaber
Copy link
Member

The issue here is that Post, Leid, and Zahl are correct on their own, thus the algorithm doesn't come up with suggestions. Plus "Leit" is one of those compound parts that only appear in compounds and not on their own. Plus "Postleitzahl" as a whole word is not part of our binary dictionary, which builds on a hunspell export, and hunspell also recognizes Postleitzahl only by accepting it as a compound. Possible solutions:

  • If no suggestions are found, use hunspell for suggestions again. Slow, but probably okay as it doesn't happen too often. Would help for this case but probably not for others where we already have (bad) suggestions.
  • Extend the binary dictionary with more compound words, e.g. with Jan's list.

@janschreiber
Copy link
Contributor Author

janschreiber commented Jun 23, 2017

Thanks for your comment, @danielnaber. My two cents:

Both approaches look viable to me.

(1) I tested Hunspell's suggestions (in LibreOffice 5.1.6.2, latest frami dictionary) on three recent examples for which we don't have suggestions right now:

  • schlaganfal (BTW: again, it seems weird that there is no suggestion here)
  • Postleidzahl
  • analpherbet

(All three are examples that show IMO that at the moment, we're deserting the users that need our help the most if we can't come up with suggestions. That's probably why we are receiving so many user suggestions of disturbingly poor quality.)

Hunspell suggests:

  • Schlaganfallrisiko (good word, but a bit weird, misses the most natural suggestion)
  • Postleitzahl (correct)
  • Postleihzahl (a bit meh but acceptable)
  • Plastidenzahl (apparently an actual word, so fine)
  • Analphabetenrate (far off, but a good word)
  • Analphabeten (expected suggestion)
  • heranarbeiteten (okay, why not?)
  • Analphabet (fine)

Based on this small random sample, I daresay that the user experience for users with poor spelling abilities would be greatly enhanced over what we have now by implementing your first idea.

(2) As a long-term user and huge fan of Hunspell (ever since it was first included in OpenOffice), I know its weaknesses. Basically, Hunspell accepts and suggests too many odd compounds, as most other spell checkers that support compounding do. Examples include (iirc) Mistreiter, Parklatz, Bestelllungen, Absatzahlen, Nationalsoziallisten, Weltrum, Tonbäder, Schutzschilder, Sitzbanken, Linienbusen, Sozialfond. Those are relatively subtle misspellings that are hard to spot for humans, that's why I consider it dangerous to include them in the suggestions. Blacklisting those would be a huge effort.

If we merge my list (or any other large, orthographically clean list of words) into the binary dictionary and prioritize them over the compound words generated on the fly (i. e., show them first in the context menu) and add the automatically generated suggestions in case a compound is not in the binary dictionary, I'm pretty confident that would do the job way better than any existing solution.

@janschreiber
Copy link
Contributor Author

Another odd thing I found today: The checker accepts "Ratscafé" and rejects "Ratscafe", which is fine, but the correct word is not in the list of suggestions.

@janschreiber
Copy link
Contributor Author

Even worse: Suggestions don't work for 'Jezt' with an uppercase J. Works for lowercase j.

@janschreiber
Copy link
Contributor Author

Another case I found today: Umzugsvorber_ie_tungen. No suggestions.

@danielnaber
Copy link
Member

I have a local fix for "Postleidzahl". Unfortunately there's a side effect of adding more and more words to the dictionary: morfologik's Speller.findReplacements() will only work on words that are "misspelled", i.e. not in its dictionary. Thus, we don't get any suggestions for e.g. Henrik because it's now in the morfologik dictionary, but not in the hunspell one. I guess we need to modify morfologik to get this working for our (rather special) use case.

@janschreiber
Copy link
Contributor Author

I have a list (3 MB) of the words in my wordlist that are not accepted by Hunspell. If I understand the problem correctly, it would help if we either remove those before merging the list into the binary suggestions dictionary or add them to spelling.txt.
All those words were programmatically checked with Duden Korrektor or MS Word, but most of them not manually. Perhaps it would be the cleanest solution to remove them.
In any case, it would be quite irritating for the users if the spell checker suggest words that it then considers misspelled, or marks a word as misspelled and then suggest that exact same word.

@danielnaber
Copy link
Member

If I understand the problem correctly, it would help if we either remove those before merging the list into the binary suggestions dictionary

That's correct, I tried that locally and it helps. There are still strange issues left, e.g. "Henrik" now has better suggestions (like "Hendrik"), but "Flucke" doesn't have the obvious "Flocke" suggestion. This is tricky to debug, it's probably related to "flocke" (lowercase) being one of the suggestions...

@janschreiber
Copy link
Contributor Author

This is tricky to debug, it's probably related to "flocke" (lowercase) being one of the suggestions...

If case sensitivity is the problem, it might help if you merge the file uppercase_candidates.txt into the binary dictionary. It is a (very incomplete) list of words that can be both uppercase and lowercase. The uppercase variants are removed from german.dic because Aspell doesn't work properly if they are present.

@janschreiber
Copy link
Contributor Author

A few more real-life examples from the users' suggestions (the misspelling is followed by the intended suggestion):

  1. Gutschaine → Gutscheine (works already)
  2. Komunkationsinstrumente → Kommunikationsinstrumente
  3. mogligkeit → Möglichkeit
  4. Gelbensäcke → gelben Säcke
  5. WIFI → Wi-Fi
  6. Aussenvisualisierung → Außenvisualisierung

Except for the first example, the current status is that LanguageTool makes no or misleading suggestions.
Aspell with my word list gets the first three right, because 'Außenvisualisierung' is not in the dictionary yet. 'Fi' and 'Wi' will not be added.
Hunspell gets the first and the last one right.
Google Docs doesn't consider 'Aussenvisualisierung' an error and suggests 'moglichkeit' in the third example, but gets the others right.
The Duden online checker only has the proper suggestion for 'Aussenvisualisierung', no suggestions at all for the other cases. It just says "Check the spelling," which is quite unhelpful for many users.

I think it might help if we tweak the calculation of the edit distance in some cases. The idea is that some characters (or character groups) are so similar that it should be counted as less than one edit to go from one to the other, maybe 0.5.

In the third example, I see three ways to get 'mogligkeit' closer to 'Möglichkeit' in terms of Levenshtein distance:

  • If the correction converts the first letter to uppercase, this might even be considered a distance of zero, at least for German. (It's the same letter in some sense, after all.)
  • Converting 'g' to 'ch' could be considered a single edit instead of two. Possibly the same for 'er' and 'a', at least at the word end: 'Bölla' is very likely an attempt to write 'Böller'. Certainly the step from 'ss' to 'ß' and vice versa should be counted as one edit, maybe even less than one. The same for 'oe' and 'ö' etc. Another candidate: 'x' and 'chs' (Fux/Fuchs).
  • The step from a vowel to its umlaut counterpart is less than one full edit IMO.

Applying that basic idea to 'mogligkeit', I end up with a distance of, say, 1.5 instead of 4. Applying the same logic, 'Molligkeit' is still closer and should be higher in the list of suggestions unless statistics suggests otherwise. But at least in this case, even a weak speller should be able to figure out which one of the two is the intended word.

@danielnaber
Copy link
Member

In the third example, I see three ways to get 'mogligkeit' closer to 'Möglichkeit' in terms of Levenshtein distance:

This already exists, but as with almost everything related to spell checking in LT, the situation is a bit complex:

@janschreiber
Copy link
Contributor Author

On a related note, it might help if we relax the maximum edit distance for long misspelled words, but only when searching in the morfologik dictionary, otherwise we will probably suggest too many nonsensical words, and it would be computationally costly/slow.
A distance of 2 is often very low.
Something like this:

// if searching for suggestions in the morfologik dic
if (wrongWord.length > 8) {
    maxEdDist = 4;
} else if (wrongWord.length == 2) {
    maxEdDist = 1; // avoid suggesting 'an' for 'cu' etc.
} else {
    maxEdDist = MAX_EDIT_DISTANCE;
}

danielnaber added a commit that referenced this issue Jul 12, 2017
@danielnaber
Copy link
Member

Maybe @jaumeortola can help us with the case issue? schlaganfal has two typos, the correct word is Schlaganfall. Still, we don't get a suggestion. Can we ignore upper/lowercase for suggestions? With fsa.dict.speller.equivalent-chars=s S,S s I didn't see an improvement. But even without that, schlaganfal should have a distance of 2 and should thus be suggested. Jaume, if you think you can help but need a minimal test case, I could create one.

danielnaber added a commit that referenced this issue Jul 13, 2017
…ds to the morfologik dictionary (#725, de-DE only for now due to the size of the dictionary)
@danielnaber
Copy link
Member

"Postleidzahl" should provide a good suggestion now, as Jan's list is now used (minus the words hunspell wouldn't accept, as discussed above).

@janschreiber
Copy link
Contributor Author

"Postleidzahl" should provide a good suggestion now

It does! Based on my tests so far, my impression is that the improvement is huge! In most of the cases I get exactly the correction that I would expect from a human proofreader as the first or second suggestion in the list. I'm very impressed!

@danielnaber
Copy link
Member

Feel free to add misspelled words to languagetool-language-modules/de/src/test/resources/suggestions.txt in the format word => - this file is used by a test case, but one that doesn't run automatically. I'll run it from time to time to see how suggestions improve/change.

@f-knorr
Copy link
Contributor

f-knorr commented Jul 15, 2017

Weird behavior: I have entered "In aller stile." and "stile" is correctly identified as typo, but LT does not suggest the uppercase variant "Stile". (Remark: I have modified an existing rule that suggests "Stille" for "Stile")

@jaumeortola
Copy link
Member

schlaganfal > Schlaganfall is now fixed. I am a bit confused. How was it done? With the new speller dictionary? So "Schlaganfall" was missing in the old one?

@janschreiber
Copy link
Contributor Author

So "Schlaganfall" was missing in the old one?

Yes. The old dic was a Hunspell export, and Hunspell doesn't need simple compounds like this, because it generates them on-the-fly.

@danielnaber
Copy link
Member

Weird behavior: I have entered "In aller stile." and "stile" is correctly identified as typo, but LT does not suggest the uppercase variant "Stile".

I think this is caused by this if in morfologik. @jaumeortola any opinion on whether this could be changed, maybe optionally?

@jaumeortola
Copy link
Member

jaumeortola commented Jul 15, 2017

Yes. That is the reason. You have the replacement i > y, and once "style" is found, the search is stopped. I think we should change this condition and always accept suggestions with a case change. So we will suggest Stile and style for stile.

@janschreiber
Copy link
Contributor Author

Closing this for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants