Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add score normalization and combination documentation #4985

Merged
merged 36 commits into from
Sep 22, 2023
Merged
Changes from 1 commit
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
6ab7d57
Add search phase results processor
kolchfa-aws Aug 28, 2023
79e9597
Add hybrid query
kolchfa-aws Aug 29, 2023
69d4274
Normalization processor additions
kolchfa-aws Sep 6, 2023
06dcb26
Add more details
kolchfa-aws Sep 6, 2023
f0d1667
Continue writing
kolchfa-aws Sep 7, 2023
0ff381f
Add more query then fetch details and diagram
kolchfa-aws Sep 7, 2023
32f7a6e
Small rewording
kolchfa-aws Sep 7, 2023
8b0bb3d
Leaner left nav headers
kolchfa-aws Sep 7, 2023
76e5164
Tech review feedback
kolchfa-aws Sep 7, 2023
2fe3464
Add semantic search tutorial
kolchfa-aws Sep 10, 2023
c353572
Reworded prerequisites
kolchfa-aws Sep 11, 2023
9cff096
Removed comma
kolchfa-aws Sep 11, 2023
7ee90cd
Rewording advanced prerequisites
kolchfa-aws Sep 11, 2023
7f360ba
Changed searching for ML model to shorter request
kolchfa-aws Sep 11, 2023
a898585
Update task type in register model response
kolchfa-aws Sep 11, 2023
6e1a73c
Changing example
kolchfa-aws Sep 12, 2023
b842fcf
Added huggingface prefix to model names
kolchfa-aws Sep 12, 2023
d7971cb
Change example responses
kolchfa-aws Sep 12, 2023
6ca775f
Added note about huggingface prefix
kolchfa-aws Sep 12, 2023
b16de8d
Update _ml-commons-plugin/semantic-search.md
kolchfa-aws Sep 12, 2023
f7bc213
Implemented doc review comments
kolchfa-aws Sep 12, 2023
c605b5a
List weights under parameters
kolchfa-aws Sep 12, 2023
1f89522
Remove one-shard warning for normalization processor
kolchfa-aws Sep 12, 2023
1bbb929
Apply suggestions from code review
kolchfa-aws Sep 13, 2023
e42f8ad
Implemented editorial comments
kolchfa-aws Sep 13, 2023
76a893b
Editorial comments and resolve merge conflicts
kolchfa-aws Sep 13, 2023
e126508
Change links
kolchfa-aws Sep 13, 2023
0c7b587
More editorial feedback
kolchfa-aws Sep 13, 2023
6d48caf
Change model-serving framework to ML framework
kolchfa-aws Sep 13, 2023
838b42f
Use get model API to check model status
kolchfa-aws Sep 13, 2023
9ead908
Implemented tech review comments
kolchfa-aws Sep 13, 2023
8f292f1
Added neural search description and diagram
kolchfa-aws Sep 14, 2023
6fd7468
More editorial comments
kolchfa-aws Sep 15, 2023
20cb3df
Add link to profile API
kolchfa-aws Sep 15, 2023
0c3f589
Addressed more tech review comments
kolchfa-aws Sep 18, 2023
76036c4
Implemented editorial comments on changes
kolchfa-aws Sep 18, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 39 additions & 4 deletions _ml-commons-plugin/semantic-search.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@

## Prerequisites

For this simple example, you'll use an OpenSearch-provided machine learning (ML) model and a cluster with no dedicated ML nodes. To ensure that this basic local setup works, send the following request to update ML-related cluster settings:
For this simple setup, you'll use an OpenSearch-provided machine learning (ML) model and a cluster with no dedicated ML nodes. To ensure that this basic local setup works, send the following request to update ML-related cluster settings:

```json
PUT _cluster/settings
Expand All @@ -61,7 +61,7 @@
For a more advanced setup, note the following requirements:

- To register a custom model, you need to specify an additional `"allow_registering_model_via_url": "true"` cluster setting.
- On clusters with dedicated ML nodes, you may want to specify `"only_run_on_ml_node": "true"` for improved performance.
- In production, it's best practice to separate the workloads by having dedicated ML nodes. On clusters with dedicated ML nodes, specify `"only_run_on_ml_node": "true"` for improved performance.

For more information about ML-related cluster settings, see [ML Commons cluster settings]({{site.url}}{{site.baseurl}}/ml-commons-plugin/cluster-settings/).

Expand Down Expand Up @@ -297,7 +297,7 @@
}
```

The response contains the model information, including the `model_state` (`REGISTERED`) and the number of chunks into which it was split `total_chunks` (27).
The response contains the model information. You can see that the `model_state` is `REGISTERED`. Additionally, the model was split into 27 chunks, as shown in the `total_chunks` field.
</details>

#### Advanced: Registering a custom model
Expand Down Expand Up @@ -427,7 +427,7 @@

### Step 2(a): Create an ingest pipeline for neural search

The first step in setting up [neural search]({{site.url}}{{site.baseurl}}/search-plugins/neural-search/) is to create an [ingest pipeline]({{site.url}}{{site.baseurl}}/api-reference/ingest-apis/index/). The ingest pipeline will contain one processor: a task that transforms document fields. For neural search, you'll need to set up a `text_embedding` processor that takes in text and creates vector embeddings from that text. You'll need the `model_id` of the model you set up in the previous section and a `field_map`, which specifies the name of the field from which to take the text (`text`) and the name of the field in which to record embeddings (`passage_embedding`):
Now that you have deployed a model, you can use this model to configure [neural search]({{site.url}}{{site.baseurl}}/search-plugins/neural-search/). First, you need to create an [ingest pipeline]({{site.url}}{{site.baseurl}}/api-reference/ingest-apis/index/) that contains one processor: a task that transforms document fields before documents are ingested into an index. For neural search, you'll set up a `text_embedding` processor that creates vector embeddings from text. You'll need the `model_id` of the model you set up in the previous section and a `field_map`, which specifies the name of the field from which to take the text (`text`) and the name of the field in which to record embeddings (`passage_embedding`):

```json
PUT /_ingest/pipeline/nlp-ingest-pipeline
Expand Down Expand Up @@ -588,6 +588,37 @@
```
{% include copy-curl.html %}

When the documents are ingested into the index, the `text_embedding` processor creates an additional field that contains vector embeddings and adds that field to the document. To see an example document that is indexed, search for document 1:

```json
GET /my-nlp-index/_search/1
```
{% include copy-curl.html %}

The response shows the document `_source` containing the original `text` and `id` fields and the added `passage_embeddings` field:

```json
{
"_index": "my-nlp-index",
"_id": "1",
"_version": 1,
"_seq_no": 0,
"_primary_term": 1,
"found": true,
"_source": {
"passage_embedding": [
0.04491629,
-0.34105563,
0.036822468,
-0.14139028,
...
],
"text": "A West Virginia university women 's basketball team , officials , and a small gathering of fans are in a West Virginia arena .",
"id": "4319130149.jpg"
}
}
```

## Step 3: Search the data

Now you'll search the index using keyword search, neural search, and a combination of the two.
Expand Down Expand Up @@ -785,7 +816,7 @@

### Search using a combined keyword search and neural search

To combine keyword search and neural search, you need to set up a [search pipeline]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/index/) that runs at search time. The search pipeline you'll configure intercepts search results at an intermediate stage and applies the [`normalization_processor`]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/normalization-processor/) to them. The `normalization_processor` normalizes and combines the document scores from multiple query clauses, rescoring the documents according to the chosen normalization and combination techniques.

Check failure on line 819 in _ml-commons-plugin/semantic-search.md

View workflow job for this annotation

GitHub Actions / vale

[vale] _ml-commons-plugin/semantic-search.md#L819

[OpenSearch.Spelling] Error: rescoring. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.
Raw output
{"message": "[OpenSearch.Spelling] Error: rescoring. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_ml-commons-plugin/semantic-search.md", "range": {"start": {"line": 819, "column": 511}}}, "severity": "ERROR"}

#### Step 1: Configure a search pipeline

Expand Down Expand Up @@ -943,6 +974,10 @@

You can now experiment with different weights, normalization techniques, and combination techniques. For more information, see the [`normalization_processor`]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/normalization-processor/) and [`hybrid` query]({{site.url}}{{site.baseurl}}/query-dsl/compound/hybrid/) documentation.

#### Advanced

You can parametrize the search by using search templates, hiding implementation details and reducing the number of nested levels and thus the query complexity. For more information, see [search templates]({{site.url}}{{site.baseurl}}/search-plugins/search-template/).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"parameterize". "...by using search templates, hiding implementation details, or reducing the number of nested levels, thus reducing the query complexity"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded.


### Clean up

After you're done, delete the components you've created in tutorial from the cluster:
Expand Down