-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] adds new n_gram_encoding custom processor #61578
[ML] adds new n_gram_encoding custom processor #61578
Conversation
Pinging @elastic/ml-core (:ml) |
run elasticsearch-ci/packaging-sample-windows |
@elasticmachine update branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
*/ | ||
public class NGram implements PreProcessor { | ||
|
||
public static final long SHALLOW_SIZE = RamUsageEstimator.shallowSizeOfInstance(NGram.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't see this field being used in the client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, should be deleted.
this.field = ExceptionsHelper.requireNonNull(field, FIELD); | ||
this.featurePrefix = ExceptionsHelper.requireNonNull(featurePrefix, FEATURE_PREFIX); | ||
this.nGrams = ExceptionsHelper.requireNonNull(nGrams, NGRAMS); | ||
if (Arrays.stream(this.nGrams).anyMatch(i -> i < 1)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (Arrays.stream(this.nGrams).anyMatch(i -> i < 1)) { | |
if (Arrays.stream(this.nGrams).anyMatch(i -> (i < MIN_GRAM) || (i > MAX_GRAM))) { |
final int len = Math.min(startPos + length, stringValue.length()); | ||
for (int i = 0; i < len; i++) { | ||
for (int nGram : nGrams) { | ||
if (startPos + i + nGram - 1 >= len) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (startPos + i + nGram - 1 >= len) { | |
if (startPos + i + nGram > len) { |
null, | ||
null, | ||
Collections.singletonList(new NGram(TEXT_FIELD, "f", new int[]{1, 2}, 0, 2, true)))) | ||
.setAnalyzedFields(new FetchSourceContext(true, new String[]{TEXT_FIELD, NUMERICAL_FIELD}, new String[]{})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this confusing at first because I thought the analyzed fields should include the ngram f.x
fields and exclude the TEXT_FIELD
. setAnalyzedFields
is now poorly named it is more like setFetchedFields
.
Is there a way of specifying which ngrams fields should be modelled or indeed for the output of any pre-processor which fields are used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
analyzed_fields
= All fields grabbed from docs. These fields are chosen for FULL analysis (including being processed)
There is no way of specifying feature inclusion for processed features. They are always included. This is for API simplicity.
Maybe renaming analyzed_fields
to fetched_fields
is proper.
@dimitris-athanasiou ^ what do you think?
@elasticmachine update branch |
This adds a new `n_gram_encoding` feature processor for analytics and inference. The focus of this processor is simple ngram encodings that allow: - multiple ngrams [1..5] - Prefix, infix, suffix
* [ML] adds new n_gram_encoding custom processor (#61578) This adds a new `n_gram_encoding` feature processor for analytics and inference. The focus of this processor is simple ngram encodings that allow: - multiple ngrams [1..5] - Prefix, infix, suffix
This adds a new
n_gram_encoding
feature processor for analytics and inference.The focus of this processor is simple ngram encodings that allow:
Format
Example usage:
The features names returned from the encoding have the following format:
Example:
for the string
cat
withfeature_prefix: "f"