Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Phrase Suggester #2709

Closed
s1monw opened this issue Feb 28, 2013 · 38 comments
Closed

Add Phrase Suggester #2709

s1monw opened this issue Feb 28, 2013 · 38 comments

Comments

@s1monw
Copy link
Contributor

s1monw commented Feb 28, 2013

Phrase Suggester

The term suggester provides a very convenient API to access word alternatives on token
basis within a certain string distance. The API allows accessing each token in the stream
individually while suggest-selection is left to the API consumer. Yet, often already ranked
/ selected suggestions are required in order to present to the end-user.
Inside ElasticSearch we have the ability to access way more statistics and information quickly
to make better decision which token alternative to pick or if to pick an alternative at all.

This phrase suggester adds some logic on top of the term suggester to select entire
corrected phrases instead of individual tokens weighted based on a ngram-langugage models. In practice it
will be able to make better decision about which tokens to pick based on co-occurence and frequencies.
The current implementation is kept quite general and leaves room for future improvements.

API Example

The phrase request is defined along side the query part in the json request:

curl -s -XPOST 'localhost:9200/_search' -d {
  "suggest" : {
    "text" : "Xor the Got-Jewel",
    "simple_phrase" : {
      "phrase" : {
        "analyzer" : "body",
        "field" : "bigram",
        "size" : 1,
        "real_word_error_likelihood" : 0.95,
        "max_errors" : 0.5,
        "gram_size" : 2,
        "direct_generator" : [ {
          "field" : "body",
          "suggest_mode" : "always",
          "min_word_len" : 1
        } ]
      }
    }
  }
}

The response contains suggested sored by the most likely spell correction first. In this case we got the expected correction
xorr the god jewel first while the second correction is less conservative where only one of the errors is corrected. Note, the request
is executed with max_errors set to 0.5 so 50% of the terms can contain misspellings (See parameter descriptions below).

  {
  "took" : 37,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2938,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "suggest" : {
    "simple_phrase" : [ {
      "text" : "Xor the Got-Jewel",
      "offset" : 0,
      "length" : 17,
      "options" : [ {
        "text" : "xorr the god jewel",
        "score" : 0.17877324
      }, {
        "text" : "xor the god jewel",
        "score" : 0.14231323
      } ]
    } ]
  }
}

Phrase suggest API

Basic parameters

  • field - the name of the field used to do n-gram lookups for the language model, the suggester will use this field to gain statistics to score corrections.
  • gram_size - sets max size of the n-grams (shingles) in the field. If the field doesn't contain n-grams (shingles) this should be omitted or set to 1.
  • real_word_error_likelihood - the likelihood of a term being a misspelled even if the term exists in the dictionary. The default it 0.95 corresponding to 5% or the real words are misspelled.
  • confidence - The confidence level defines a factor applied to the input phrases score which is used as a threshold for other suggest candidates. Only candidates that score higher than the threshold will be included in the result. For instance a confidence level of 1.0 will only return suggestions that score higher than the input phrase. If set to 0.0 the top N candidates are returned. The default is 1.0.
  • max_errors - the maximum percentage of the terms that at most considered to be misspellings in order to form a correction. This method accepts a float value in the range [0..1) as a fraction of the actual query terms a number >=1 as an absolut number of query terms. The default is set to 1.0 which corresponds to that only corrections with at most 1 misspelled term are returned.
  • separator - the separator that is used to separate terms in the bigram field. If not set the whitespce character is used as a separator.
  • size - the number of candidates that are generated for each individual query term Low numbers like 3 or 5 typically produce good results. Raising this can bring up terms with higher edit distances. The default is 5.
  • analyzer - Sets the analyzer to analyse to suggest text with. Defaults to the search analyzer of the suggest field passed via field.
  • shard_size - Sets the maximum number of suggested term to be retrieved from each individual shard. During the reduce phase the only the top N suggestions are returned based on the size option. Defaults to 5.
  • text - Sets the text / query to provide suggestions for.

Smoothing Models

The phrase suggester supports multiple smoothing models to balance weight between infrequent grams (grams (shingles) are not existing in the index) and frequent grams (appear at least once in the index).

  • laplace - the default model that uses an additive smoothing model where a constant (typically 1.0 or smaller) is added to all counts to balance weights, The default alpha is 0.5.
  • stupid_backoff - a simple backoff model that backs off to lower order n-gram models if the higher order count is 0 and discounts the lower order n-gram model by a constant factor. The default discount is 0.4.
  • linear_interpolation - a smoothing model that takes the weighted mean of the unigrams, bigrams and trigrams based on user supplied weights (lambdas). Linear Interpolation doesn't have any default values. All parameters (trigram_lambda, bigram_lambda, unigram_lambda) must be supplied.

Candidate Generators

The phrase suggester uses candidate generators to produce a list of possible terms per term in the given text. A single candidate generator is similar to a term suggester called for each individual term in the text. The output of the generators is subsequently scored in in combination with the candidates from the other terms to for suggestion candidates.
Currently only one type of candidate generator is supported, the direct_generator. The Phrase suggest API accepts a list of generators under the key direct_generator each of the generators in the list are called per term in the original text.

Direct Generators

The direct generators support the following parameters:

  • field - The field to fetch the candidate suggestions from. This is an required option that either needs to be set globally or per suggestion.
  • analyzer - The analyzer to analyse the suggest text with. Defaults to the search analyzer of the suggest field.
  • size - The maximum corrections to be returned per suggest text token.
  • suggest_mode - The suggest mode controls what suggestions are included or controls for what suggest text terms, suggestions should be suggested. Three possible values can be specified:
    • missing - Only suggest terms in the suggest text that aren't in the index. This is the default.
    • popular - Only suggest suggestions that occur in more docs then the original suggest text term.
    • always - Suggest any matching suggestions based on terms in the suggest text.
  • max_edits - The maximum edit distance candidate suggestions can have in order to be considered as a suggestion. Can only be a value between 1 and 2. Any other value result in an bad request error being thrown. Defaults to 2.
  • min_prefix - The number of minimal prefix characters that must match in order be a candidate suggestions. Defaults to 1. Increasing this number improves spellcheck performance. Usually misspellings don't occur in the beginning of terms.
  • min_query_length - The minimum length a suggest text term must have in order to be included. Defaults to 4.
  • max_inspections - A factor that is used to multiply with the shards_size in order to inspect more candidate spell corrections on the shard level. Can improve accuracy at the cost of performance. Defaults to 5.
  • threshold_frequency - The minimal threshold in number of documents a suggestion should appear in. This can be specified as an absolute number or as a relative percentage of number of documents. This can improve quality by only suggesting high frequency terms. Defaults to 0f and is not enabled. If a value higher than 1 is specified then the number cannot be fractional. The shard level document frequencies are used for this option.
  • max_query_frequency - The maximum threshold in number of documents a sugges text token can exist in order to be included. Can be a relative percentage number (e.g 0.4) or an absolute number to represent document frequencies. If an value higher than 1 is specified then fractional can not be specified. Defaults to 0.01f. This can be used to exclude high frequency terms from being spellchecked. High frequency terms are usually spelled correctly on top of this this also improves the spellcheck performance. The shard level document frequencies are used for this option.
  • pre_filter - a filter (analyzer) that is applied to each of the tokens passed to this candidate generator. This filter is applied to the original token before candidates are generated. (optional)
  • post_filter - a filter (analyzer) that is applied to each of the generated tokens before they are passed to the actual phrase scorer. (optional)

The following example shows a phrase suggest call with two generators, the first one is using a field containing ordinary indexed terms and the second one uses a field that uses
terms indexed with a reverse filter (tokens are index in reverse order). This is used to overcome the limitation of the direct generators to require a constant prefix to provide high-performance suggestions. The pre_filter and post_filter options accept ordinary analyzer names.

curl -s -XPOST 'localhost:9200/_search' -d {
 "suggest" : {
    "text" : "Xor the Got-Jewel",
    "simple_phrase" : {
      "phrase" : {
        "analyzer" : "body",
        "field" : "bigram",
        "size" : 4,
        "real_word_error_likelihood" : 0.95,
        "confidence" : 2.0,
        "gram_size" : 2,
        "direct_generator" : [ {
          "field" : "body",
          "suggest_mode" : "always",
          "min_word_len" : 1
        }, {
          "field" : "reverse",
          "suggest_mode" : "always",
          "min_word_len" : 1,
          "pre_filter" : "reverse",
          "post_filter" : "reverse"
        } ]
      }
    }
  }
}

pre_filter and post_filter can also be used to inject synonyms after candidates are generated. For instance for the query captain usq we might generate a candidate usa for term usq which is a synonym for america which allows to present captain america to the user if this phrase scores high enough.

@s1monw s1monw closed this as completed in d4ec03e Feb 28, 2013
@Downchuck
Copy link

naive is a more commonly used term (vs "stupid") for the StupidBackoff param.

@s1monw
Copy link
Contributor Author

s1monw commented Mar 1, 2013

@Downchuck I haven't heard of "Naive Backoff" - do you have any pointers where folks refer to this language model as "Naive Backoff" vs. "Stupid Backoff"

@Downchuck
Copy link

Oops, turns out I'm completely wrong (thinking of something else); I should pay more attention when I'm at work.

@s1monw
Copy link
Contributor Author

s1monw commented Mar 1, 2013

@Downchuck no worries - it's friday though :)

@jtreher
Copy link

jtreher commented Mar 1, 2013

Well, this looks pretty sexy.

@mattweber
Copy link
Contributor

So what is the best way to use this? Setup a multi-field say "suggestion" that does regular analysis, then "suggestion.bigram" that adds a n-gram token filter? Then set the phrase field to "suggestion.bigram" and the direct generator field to "suggestion"?

@s1monw
Copy link
Contributor Author

s1monw commented Mar 2, 2013

@mattweber yeah that is pretty much what I intended! just make sure you use shingle filter not ngram. NGram is a character n-gram this guy needs word n-grams

@mattweber
Copy link
Contributor

Would it make sense to move the smoothing options under a "smoothing" object? Something like:

curl -s -XPOST 'localhost:9200/_search' -d '{
  "suggest" : {
    "text" : "Xor the Got-Jewel",
    "simple_phrase" : {
      "phrase" : {
        "analyzer" : "body",
        "field" : "bigram",
        "size" : 1,
        "real_word_error_likelihood" : 0.95,
        "max_errors" : 0.5,
        "gram_size" : 2,
        "smoothing" : {
          "stupid_backoff" : {
            "discount" : 0.4
          }
        },
        "direct_generator" : [ {
          "field" : "body",
          "suggest_mode" : "always",
          "min_word_len" : 1
        } ]
      }
    }
  }
}'

@jtreher
Copy link

jtreher commented Mar 5, 2013

I'm trying to get corrections for missing spaces, i.e. Recipe for ItalienFood.
Did you mean: Recipe for italian food?

I'm guessing for this you would almost need to match grams stripped of whitespace?

Regarding the phrase correct in general, I'm only getting corrected results for the last word in my phrase, so I must have my field setup improperly. I'm using an out of the box shingle filter.

@s1monw
Copy link
Contributor Author

s1monw commented Mar 5, 2013

@jtreher I am afraid we don't have the infrastructure in place to support this at this point but I am working on generators that support this kind of checking.

@s1monw
Copy link
Contributor Author

s1monw commented Mar 5, 2013

@mattweber I mean we can do that, what would be the benefit? would it be more natural to use?

@mattweber
Copy link
Contributor

Yes, I think it is more natural and fits with the api better. Throughout the api there really isn't any other places where you just have variably named object like that. Doesn't really matter though.

@jtreher
Copy link

jtreher commented Mar 5, 2013

Well, I did get phrase correction working quite well save for the spaces. I'm still messing with the configs and am excited to see what you settle on for parameter names. Looking at the source works for now. :)

I was struggling initially because I did not double check my candidate field. It was using a stupid kstem filter! I was all over the place messing with the configs. I was getting good results once in a while for certain phrases. Oh well, it was a good learning experience as I think I've toyed with every setting. Flip the switch to a field using standard and bam!

@s1monw
Copy link
Contributor Author

s1monw commented Mar 5, 2013

@jtreher did you try using stupid backoff?

@jtreher
Copy link

jtreher commented Mar 5, 2013

@s1monw I changed discount all around, but didn't really see much difference.

@mattweber
Copy link
Contributor

Switching from laplace to stupid_backoff with the default discount made a huge difference in my tests.

@jtreher
Copy link

jtreher commented Mar 5, 2013

Hah, I don't think I actually switched smoothing models. No wonder discount didn't do anything. I've only been working with it for a few hours.

@mattweber
Copy link
Contributor

If it helps, this is my current settings that are returning pretty good results.

"analyzer": "standard"
"field" : "Title.Shingles"    // shingles max size of 2
"real_word_error_likelihood" : 0.95
"confidence" : 2.0
"max_errors": 0.75
"gram_size" : 2
"stupid_backoff": {}

@jtreher
Copy link

jtreher commented Mar 5, 2013

Yeah, I just got it after scouring the source again. I figured that since "discount" wasn't throwing an error, it must be working. I wasn't putting it in the stupid backoff object. Thanks for the support guys.

@s1monw
Copy link
Contributor Author

s1monw commented Mar 6, 2013

@mattweber @jtreher I will go ahead and open an issue to move the smoothing into it's own object and make sure we fail if there is a parameter that is not known (I actually thought this works already....) - thanks for getting all this infos back to me!

@s1monw
Copy link
Contributor Author

s1monw commented Mar 6, 2013

see #2735

@jtreher
Copy link

jtreher commented Mar 6, 2013

@s1monw Thanks a ton again. I see you changed the term API to match the phrase. Nicely done.

Now, if I can only get tawl to turn into towel without using a phonetic filter! Phonetic bigrams are providing some interesting results with this as well, but it's a lot of traffic. I'm also experimenting with the character gram filtered with a regex that strips whitespace. I've been able to make sense out of a lot of mistyped phrases. It's quite satisfying.

@jtreher
Copy link

jtreher commented Mar 8, 2013

@s1monw Is there something that would be constraining suggest size to 5, even with the size field override? At first I thought it was just an edit distance issue, but I've set my string_distance to be straight-up Levenshtein, and I know that the edit distance is equal for tawl => towel and tawl => tail (2). It seems that I only ever get 5 results. I can set size to less than 5 which works out well, but I can't seem to get more. I find the same with Term and Phrase in Beta2. I use max_edits:2.

@s1monw
Copy link
Contributor Author

s1monw commented Mar 8, 2013

@jtreher there are 2 size params one on the phrase suggester itself and one on the candidate generator. you should be able to override the one on the candidate generator to get what you want.

I actually thought about the phonetic stuff. So in theory you can actually make this work with phonetics as well. If you have a field that creates tokens in a certain way like "soundex|actualword" in your example "T400|towel" you can build a direct generator that uses a prefix_len of 5 and produces tokens like "T400|tawl" with a pre-filter on the generator and removes the 5 leading chars with a post filter. Maybe you give it a try but that way you only getting LD matches that also have the same soundex code

@jtreher
Copy link

jtreher commented Mar 8, 2013

@s1monw Interesting idea, I also found some other interesting uses of the phonetic filter of the last few days. I'm finding doublemetaphone to be the most useful. I think I might have found a bug with phonetic highlighting as well. Usually it works, but once in a while it will highlight the whole string. I'll test that more and log it if so.

Adding size:10 to both areas results in the same effect, I'm afraid.

@s1monw
Copy link
Contributor Author

s1monw commented Mar 9, 2013

@jtreher you are right, see #2752

note if you raise the # of candidates you should also raise (lower) accuracy to get "far away" candidates

@jtreher
Copy link

jtreher commented Mar 11, 2013

Edited. Thanks! RE #2752. I looked at the revision and it seems that most files are hitting the phrase. I found that the term is also not respecting the size. Does this fix that?

@s1monw
Copy link
Contributor Author

s1monw commented Mar 12, 2013

@jtreher size is a parameter on the candidate generator. if you set size on term you should also set shard_size if you have only one shard. shard_size is used for the number of terms per shard. Hope that helps

@jtreher
Copy link

jtreher commented Apr 10, 2013

What happened to the threshold_frequency for the direct generator? Some other arguments seem to be throwing illegal argument exceptions as well.

Also, of note, this behavior might cause some new to information retrieval users, like myself, to be quite confused. I think it just needs more attention in the documentation, so I will put a small note on this page. Notice my query below to _suggest REST API. If I leave the size as default for the candidate generator, it never finds "towel" for "paper tawl" even though it is the second most common phrase with the word paper. If I override size to 10 in the candidate generator, it finds it and puts it at position 2. So, obviously users need to understand that the direct generator doesn't care about the whole phrase, it is merely providing candidates for the phrase suggest to use in it's shingle calculations. I'm not sure how the candidate generator sorts, but I can see that tawl scores very low by default on a text suggest, but is ordered much higher when it comes to frequency sorting.

Of course it is definitely a balancing act, but just wanted to throw this out there for any google ninjas.

{
"text": "paper tawl",
"did_you_mean": {
"phrase": {
"field": "description_spellcheck_biword_shingle",
"size": 5,
"direct_generator": [
{
"field": "description_plain",
"max_edits": 2,
"size": 10
}
]
}
}
}
//with size 10, we get the right results
did_you_mean: [
{
text: paper tawl
offset: 0
length: 10
options: [
{
text: paper table
score: 0.005753702
}
{
text: paper towel
score: 0.00501787
}
{
text: paper take
score: 0.002958646
}
{
text: paper tall
score: 0.000378143
}
{
text: paper teal
score: 0.00027579395
}
]
}
]
//if we don't give it enough candidates, we don't get the right results
did_you_mean: [

{
    text: paper tawl
    offset: 0
    length: 10
    options: [
        {
            text: paper table
            score: 0.005753702
        }
        {
            text: paper take
            score: 0.002958646
        }
        {
            text: paper tall
            score: 0.000378143
        }
        {
            text: paper tank
            score: 0.00022716245
        }
        {
            text: paper tail
            score: 0.000048908416
        }
    ]
}

]

@s1monw
Copy link
Contributor Author

s1monw commented Apr 10, 2013

hey, yeah I am sorry it's not perfect so you still need to know something about how the software works. Regarding the threshold_frequency it's been renamed to min_doc_freq I will update documentation. Did you encounter any other not working params?

@jtreher
Copy link

jtreher commented Apr 10, 2013

I think it works fine! Thanks so much for the awesome work. Again, I just wanted to write that down for someone who might stumble upon this page.

Here is the current "live" documentation and the arguments that aren't working.
http://www.elasticsearch.org/guide/reference/api/search/suggest/
CandidateGenerator doesn't support [max_query_frequency]
[CandidateGenerator doesn't support [min_query_length]]
[CandidateGenerator doesn't support [analyzer]]
[CandidateGenerator doesn't support [threshold_frequency]]

@s1monw
Copy link
Contributor Author

s1monw commented Apr 10, 2013

ah man thanks,

see these commits:

https:/elasticsearch/elasticsearch.github.com/commit/5e0eff1fa3fc544f0ffa2ad71e19316eb7a922d6
https:/elasticsearch/elasticsearch.github.com/commit/817f96edd1ed7a403e230e78bb30a1a86fd913d7

I hope I have time to add some more notes to the documentation including the problems you saw! thanks for the feedback!

@jtreher
Copy link

jtreher commented Apr 10, 2013

Also of note for documentation, setting the stupid backoff to 0 prevents unigrams from showing. Sweet!

@s1monw
Copy link
Contributor Author

s1monw commented Apr 10, 2013

@jtreher do you wanna contribute some documentation?

@jtreher
Copy link

jtreher commented Apr 10, 2013

@s1monw Sure, do I just fork the website and make some commits?

@s1monw
Copy link
Contributor Author

s1monw commented Apr 10, 2013

yeah you just go and fork the website on github, commit your changes and open a pull request!

thanks for the help!

@jtreher
Copy link

jtreher commented Apr 16, 2013

@s1monw One more question before I can prepare anything.
I might have found an issue where we use suggest_mode:always to find phrase matches where the terms themselves are in the index, but the phrase is not (for AND operator searches). Perhaps this is expected behavior? Confidence is set to 0 for testing.

Example phrase W1 W2.

  1. W1 W2 are both found in the index.
  2. W1 W2 is not found in any trigram/bigram from our field (max_shingle_size:3)
  3. W1 W3 is our target phrase we would like suggested; it is found in our bi and trigrams.
  4. W3 is suggested by the candidate generator as the 9th suggestion with a score of 0.6 out of a top score of 0.8. Verified candidate generator with a term suggest.
  5. If we misspell W2, it works correctly and presents multiple options. W1 W3 as top.
  6. W1 W2 as input has one option W1 W2 with a score of 0.000384
  7. W1 W3 as input has one option, W1 W3 with a score of 0.00946
  8. If confidence is 1, then no options are given for either input.

_suggest API

{
"text": "W1 W2",
"did_you_mean": {
"phrase": {
"field": "title_trigram",
"confidence": 0,
"direct_generator": [
{
"field": "title",
"max_edits": 2,
"size": 20,
"suggest_mode": "always"
}
]
}
}
}

@letrungtrung
Copy link

I has a problem with term suggest. I has 3 doc [doc1:"Mình hỏi nghĩ tí", doc2:"Mình hỏi nghien nghĩ tí", doc3:"Mình nghiêng hỏi ta"] . when i suggest "tí" with suggestmode is missing. why the result's received is "ta". Plz help me to explain for that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants