[ML] adds new n_gram_encoding custom processor #61578

benwtrent · 2020-08-26T12:22:06Z

This adds a new n_gram_encoding feature processor for analytics and inference.

The focus of this processor is simple ngram encodings that allow:

multiple ngrams [1..5]
Prefix, infix, suffix

Format

"n_gram_encoding": {
  "field": <input field name>,
  "n_grams": <array of int indicating ngrams desired, required. Max val 5, min val 1>,
  "feature_prefix": <optional feature name prefix. Defaults n_gram_<start>_<length>,
  "start": optional start index. Defaults to 0. Can be negative to indicate suffix starting,
  "length": optional string length to encode to ngrams. Default to 50, max 100,
}

Example usage:

PUT _ml/data_frame/analytics/foo
{
  "source": {
    "index": "kibana_sample_data_flights"
  },
  "dest": {
    "index": "goof"
  },
  "analysis": {
    "regression": {
      "dependent_variable": "DistanceKilometers",
      "num_top_feature_importance_values": 3,
      "feature_processors": [{
        "n_gram_encoding": {
          "field": "OriginCityName",
          "n_grams": [1, 2, 3],
          "feature_prefix": "f"
        }
      }]
    }
  },
  "analyzed_fields": {"includes": ["OriginCityName","DistanceKilometers"]},
  "model_memory_limit": "1gb"
}

The features names returned from the encoding have the following format:

<feature_prefix>.<n_gram><pos>

Example:
for the string cat with feature_prefix: "f"

f.20: "ca"

elasticmachine · 2020-08-26T12:22:08Z

Pinging @elastic/ml-core (:ml)

…ics-ngram-processor

benwtrent · 2020-08-31T12:11:54Z

run elasticsearch-ci/packaging-sample-windows

benwtrent · 2020-08-31T13:43:50Z

@elasticmachine update branch

davidkyle

LGTM

davidkyle · 2020-09-03T08:55:43Z

...rest-high-level/src/main/java/org/elasticsearch/client/ml/inference/preprocessing/NGram.java

+ */
+public class NGram implements PreProcessor {
+
+ public static final long SHALLOW_SIZE = RamUsageEstimator.shallowSizeOfInstance(NGram.class);


I can't see this field being used in the client.

Yep, should be deleted.

davidkyle · 2020-09-03T09:01:03Z

...plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/inference/preprocessing/NGram.java

+ this.field = ExceptionsHelper.requireNonNull(field, FIELD);
+ this.featurePrefix = ExceptionsHelper.requireNonNull(featurePrefix, FEATURE_PREFIX);
+ this.nGrams = ExceptionsHelper.requireNonNull(nGrams, NGRAMS);
+ if (Arrays.stream(this.nGrams).anyMatch(i -> i < 1)) {


Suggested change

if (Arrays.stream(this.nGrams).anyMatch(i -> i < 1)) {

if (Arrays.stream(this.nGrams).anyMatch(i -> (i < MIN_GRAM) || (i > MAX_GRAM))) {

davidkyle · 2020-09-03T09:09:28Z

...plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/inference/preprocessing/NGram.java

+ final int len = Math.min(startPos + length, stringValue.length());
+ for (int i = 0; i < len; i++) {
+ for (int nGram : nGrams) {
+ if (startPos + i + nGram - 1 >= len) {


Suggested change

if (startPos + i + nGram - 1 >= len) {

if (startPos + i + nGram > len) {

davidkyle · 2020-09-03T09:38:20Z

...s/src/test/java/org/elasticsearch/xpack/ml/integration/DataFrameAnalysisCustomFeatureIT.java

+ null,
+ null,
+ Collections.singletonList(new NGram(TEXT_FIELD, "f", new int[]{1, 2}, 0, 2, true))))
+ .setAnalyzedFields(new FetchSourceContext(true, new String[]{TEXT_FIELD, NUMERICAL_FIELD}, new String[]{}))


I found this confusing at first because I thought the analyzed fields should include the ngram f.x fields and exclude the TEXT_FIELD. setAnalyzedFields is now poorly named it is more like setFetchedFields.

Is there a way of specifying which ngrams fields should be modelled or indeed for the output of any pre-processor which fields are used?

analyzed_fields = All fields grabbed from docs. These fields are chosen for FULL analysis (including being processed)

There is no way of specifying feature inclusion for processed features. They are always included. This is for API simplicity.

Maybe renaming analyzed_fields to fetched_fields is proper.

@dimitris-athanasiou ^ what do you think?

benwtrent · 2020-09-03T13:06:47Z

@elasticmachine update branch

…ics-ngram-processor

This adds a new `n_gram_encoding` feature processor for analytics and inference. The focus of this processor is simple ngram encodings that allow: - multiple ngrams [1..5] - Prefix, infix, suffix

* [ML] adds new n_gram_encoding custom processor (#61578) This adds a new `n_gram_encoding` feature processor for analytics and inference. The focus of this processor is simple ngram encodings that allow: - multiple ngrams [1..5] - Prefix, infix, suffix

[ML] adds new n_gram_encoding custom processor

3bb1ccd

benwtrent added >enhancement :ml Machine learning v8.0.0 v7.10.0 labels Aug 26, 2020

benwtrent marked this pull request as draft August 26, 2020 12:22

benwtrent added 5 commits August 27, 2020 10:12

Merge remote-tracking branch 'upstream/master' into feature/ml-analyt…

9fdf423

…ics-ngram-processor

adding tests

7de22cd

removing debug

9481ffe

Merge remote-tracking branch 'upstream/master' into feature/ml-analyt…

69d925a

…ics-ngram-processor

fixing test

6c4507b

benwtrent marked this pull request as ready for review August 27, 2020 18:07

Merge branch 'master' into feature/ml-analytics-ngram-processor

0618709

davidkyle approved these changes Sep 3, 2020

View reviewed changes

addressing pr comments

3150398

elasticmachine and others added 3 commits September 3, 2020 07:06

Merge branch 'master' into feature/ml-analytics-ngram-processor

72fe14a

moving integration test

0baae5c

Merge remote-tracking branch 'upstream/master' into feature/ml-analyt…

686f473

…ics-ngram-processor

benwtrent merged commit 2341b20 into elastic:master Sep 3, 2020

benwtrent deleted the feature/ml-analytics-ngram-processor branch September 3, 2020 16:23

benwtrent mentioned this pull request Sep 3, 2020

[7.x] [ML] adds new n_gram_encoding custom processor (#61578) #61935

Merged

Mpdreamz mentioned this pull request Nov 16, 2020

7.10.1 Meta Ticket elastic/elasticsearch-net#5096

Closed

61 tasks

stevejgordon mentioned this pull request Dec 17, 2020

7.11.0 Meta Ticket elastic/elasticsearch-net#5198

Closed

jakelandis removed the v8.0.0 label Jul 26, 2021

jakelandis added the v8.0.0-alpha1 label Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] adds new n_gram_encoding custom processor #61578

[ML] adds new n_gram_encoding custom processor #61578

benwtrent commented Aug 26, 2020 •

edited

Loading

elasticmachine commented Aug 26, 2020

benwtrent commented Aug 31, 2020

benwtrent commented Aug 31, 2020

davidkyle left a comment

davidkyle Sep 3, 2020

benwtrent Sep 3, 2020

davidkyle Sep 3, 2020

davidkyle Sep 3, 2020

davidkyle Sep 3, 2020

benwtrent Sep 3, 2020

benwtrent commented Sep 3, 2020

	if (Arrays.stream(this.nGrams).anyMatch(i -> i < 1)) {
	if (Arrays.stream(this.nGrams).anyMatch(i -> (i < MIN_GRAM) \|\| (i > MAX_GRAM))) {

	if (startPos + i + nGram - 1 >= len) {
	if (startPos + i + nGram > len) {

[ML] adds new n_gram_encoding custom processor #61578

[ML] adds new n_gram_encoding custom processor #61578

Conversation

benwtrent commented Aug 26, 2020 • edited Loading

elasticmachine commented Aug 26, 2020

benwtrent commented Aug 31, 2020

benwtrent commented Aug 31, 2020

davidkyle left a comment

Choose a reason for hiding this comment

davidkyle Sep 3, 2020

Choose a reason for hiding this comment

benwtrent Sep 3, 2020

Choose a reason for hiding this comment

davidkyle Sep 3, 2020

Choose a reason for hiding this comment

davidkyle Sep 3, 2020

Choose a reason for hiding this comment

davidkyle Sep 3, 2020

Choose a reason for hiding this comment

benwtrent Sep 3, 2020

Choose a reason for hiding this comment

benwtrent commented Sep 3, 2020

benwtrent commented Aug 26, 2020 •

edited

Loading