Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ml inference ingest processor support for local models #2508

Merged
merged 5 commits into from
Jun 11, 2024

Conversation

rbhavna
Copy link
Collaborator

@rbhavna rbhavna commented Jun 5, 2024

Description

ml inference ingest processor support for local models

Issues Resolved

[List any issues this PR will resolve]

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

.put(MLInferenceIngestProcessor.TYPE, new MLInferenceIngestProcessor.Factory(parameters.scriptService, parameters.client));
.put(
MLInferenceIngestProcessor.TYPE,
new MLInferenceIngestProcessor.Factory(parameters.scriptService, parameters.client, xContentRegistry)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need the xContentRegistry passed from the plugin?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We add it here because it can be added as dependency argument to MLInferenceIngestProcessor when its instantiated or created in Factory.

existingFields++;
}
}
if (!override && existingFields == dotPaths.size()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if override is false and existing Fields is the same as the dothPath size, we skip adding the output mapping in silent ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it means if the ingest document had a field that already has the processor out field, e.g. text_embedding field, we can skip it in current running processor. User can explicitly set override to true to rewrite the output field though

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does the user know if the field is skipped? Maybe we can add some logging.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure will do that

int existingFields = 0;
for (String path : dotPaths) {
if (ingestDocument.hasField(path)) {
existingFields++;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when the document doesn't have the new field, will it add the newField to output mapping?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No the output fields are already added to newOutputMapping in line 204. In this for loop, we see if any of that field in the same specified path already exists in the document. In which case, if override flag is false, we remove that field from the output mapping. This will make sure we are can save time for re-processing the field that might have already been inferred previously inferred

Signed-off-by: Bhavana Ramaram <[email protected]>
jngz-es
jngz-es previously approved these changes Jun 10, 2024
List embedding1 = JsonPath.parse(document).read("_source.books[0].title_embedding");
Assert.assertEquals(1536, embedding1.size());
List embedding2 = JsonPath.parse(document).read("_source.books[1].title_embedding");
Assert.assertEquals(1536, embedding2.size());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the IT, this test for each processor with nested documents, can you also add the test that do not use for each processor? does it still work fine?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it works fine. Will add few more tests for local models

MLModelConfig modelConfig = TextEmbeddingModelConfig
.builder()
.modelType("bert")
.frameworkType(TextEmbeddingModelConfig.FrameworkType.SENTENCE_TRANSFORMERS)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the local model only supporting for sentence transformers? have you test other type of local models?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is also working on sparse_encoders and cross encoding models. Will add more unit tests

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found it difficult to add ITs for other models. Added them in UTs. Currently we dont have predict IT tests for other models. Its not letting me use pre-trained models because of size and timing-out issue. I can add them by first adding a few small model URLs to the test data first. I added a TODO within the test class

ylwu-amzn
ylwu-amzn previously approved these changes Jun 11, 2024
Signed-off-by: Bhavana Ramaram <[email protected]>
@rbhavna rbhavna dismissed stale reviews from ylwu-amzn and jngz-es via a4f711b June 11, 2024 19:34
@rbhavna rbhavna temporarily deployed to ml-commons-cicd-env June 11, 2024 19:34 — with GitHub Actions Inactive
@rbhavna rbhavna temporarily deployed to ml-commons-cicd-env June 11, 2024 19:34 — with GitHub Actions Inactive
@rbhavna rbhavna temporarily deployed to ml-commons-cicd-env June 11, 2024 19:34 — with GitHub Actions Inactive
@rbhavna rbhavna temporarily deployed to ml-commons-cicd-env June 11, 2024 19:35 — with GitHub Actions Inactive
@rbhavna rbhavna temporarily deployed to ml-commons-cicd-env June 11, 2024 19:35 — with GitHub Actions Inactive
@rbhavna rbhavna temporarily deployed to ml-commons-cicd-env June 11, 2024 19:35 — with GitHub Actions Inactive
@rbhavna rbhavna merged commit 7cd5291 into opensearch-project:main Jun 11, 2024
9 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jun 11, 2024
* ml inference ingest processor support for local models

Signed-off-by: Bhavana Ramaram <[email protected]>
(cherry picked from commit 7cd5291)
ylwu-amzn pushed a commit that referenced this pull request Jun 11, 2024
* ml inference ingest processor support for local models

Signed-off-by: Bhavana Ramaram <[email protected]>
(cherry picked from commit 7cd5291)

Co-authored-by: Bhavana Ramaram <[email protected]>
@opensearch-trigger-bot
Copy link
Contributor

The backport to feature/multi_tenancy failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-feature/multi_tenancy feature/multi_tenancy
# Navigate to the new working tree
cd .worktrees/backport-feature/multi_tenancy
# Create a new branch
git switch --create backport/backport-2508-to-feature/multi_tenancy
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 7cd52915d04d8ac7ddb6e37a74a256603587ce69
# Push it to GitHub
git push --set-upstream origin backport/backport-2508-to-feature/multi_tenancy
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-feature/multi_tenancy

Then, create a pull request where the base branch is feature/multi_tenancy and the compare/head branch is backport/backport-2508-to-feature/multi_tenancy.

dhrubo-os pushed a commit to dhrubo-os/ml-commons that referenced this pull request Oct 2, 2024
…oject#2508)

* ml inference ingest processor support for local models

Signed-off-by: Bhavana Ramaram <[email protected]>
dhrubo-os added a commit that referenced this pull request Oct 2, 2024
* ml inference ingest processor support for local models

Signed-off-by: Bhavana Ramaram <[email protected]>
Co-authored-by: Bhavana Ramaram <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants