Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Schema on Reads #1133

Closed
imRishN opened this issue Aug 22, 2021 · 35 comments
Closed

[RFC] Schema on Reads #1133

imRishN opened this issue Aug 22, 2021 · 35 comments
Labels
enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes Roadmap:Search Project-wide roadmap label Search:Query Capabilities Search Search query, autocomplete ...etc v2.15.0 Issues and PRs related to version 2.15.0

Comments

@imRishN
Copy link
Member

imRishN commented Aug 22, 2021

Problem Statement

By default, OpenSearch supports ‘schema on write’ i.e. the structure is defined at the time of ingest so that it is available for query immediately. However, as use cases for OpenSearch evolved, there is a need for greater flexibility. End users may not be aware of the data structure or may want additional attributes to query upon post ingest. This is where ‘schema on read’ is useful. With ‘schema on read’, the query result field can be defined at the time of query. This also helps greatly improve ingest rate by avoiding having to index fields that are not always going to be queried right away.

Requirements

  1. Ability to define fields that are evaluated at query time.
  2. No changes should be made to the underlying schema. This avoids the need to re-index existing data.
  3. These user defined fields should support all operations of a regular field in the query.

Existing Solution

Scripting

Scripting is supported at various constructs of the _search request body. In each of these constructs, the fundamental working is same: script is evaluated at query time, it derives value/s from the indexed field/s and acts on the derived values.

  • In query and filter context, the derived value can be used to filter out documents.
  • In aggregations, results can be aggregated on the derived value.
  • The derived values can be exposed as a custom field by including it in script_fields.
  • Results can also be sorted on the derived value.
  • Using script_score, the derived value can be used to score the filtered documents.

Shortcomings of existing solution

Scripting satisfies most of the requirements listed above but adding scripts to the request make it bulky, non-readable and difficult to manage. Even though scripts can be stored and referenced in the query, it does not help the readability.

Following example highlights the same:

GET index_1/_search
{
  "query": {
    "bool": {
      "filter": {
        "script": {
          "script": """
 return ChronoUnit.YEARS.between(doc['dob'].value, doc['create_time'].value) > 18;
 """
        }
      }
    }
  },
  "aggs": {
    "day-aggregations": {
      "histogram": {
        "interval": 10,
        "script": {
          "source": "ChronoUnit.YEARS.between(doc['dob'].value, doc['create_time'].value);"
        }
      }
    }
  },
  "sort": {
    "_script": {
      "type": "number",
      "script": {
        "source": "ChronoUnit.DAYS.between(doc['dob'].value, doc['create_time'].value);"
      },
      "order": "desc"
    }
  },
  "_source": true,
  "script_fields": {
    "age": {
      "script": "ChronoUnit.YEARS.between(doc['dob'].value, doc['create_time'].value);"
    }
  },
  "size": 10
}

Proposed Solution

Regular OpenSearch queries revolve around fields in the schema. With scripting, the query syntax changes a lot.
In the proposed solution, we aim to achieve ease of using schema on read along with all the benefits of scripting.
The proposal includes defining fields in mapping which will be evaluated at query time and behave like regular fields.

@imRishN imRishN added the enhancement Enhancement or improvement to existing feature or request label Aug 22, 2021
@imRishN imRishN changed the title Schema on Reads [RFC] Schema on Reads Oct 1, 2021
@dblock
Copy link
Member

dblock commented Nov 23, 2021

@rramachand21 are you working on this?

@anasalkouz
Copy link
Member

Thanks for putting up this proposal. Downside of this feature, clients will start taking the easy path and use schema on reads even for fields that are being used frequently. we should think of some field usage and guardrails to avoid abusing the feature.
Could you explain more about the usage of those fields? for example, can I use those fields in the aggregation. does the field searchable? or those field will be only used for data retrieval.

@lrynek
Copy link

lrynek commented Jan 25, 2022

@imRishN what about the existing runtime fields feature? Looks almost the same that you are proposing here:
👉 https://www.elastic.co/guide/en/elasticsearch/reference/current/runtime.html
Could you maybe tell in what your solution will differ from the runtime fields? (just asking for the sake of curiosity to better understand the proposal at hand 😉 )

@reta
Copy link
Collaborator

reta commented Jan 25, 2022

@lrynek You are very right, the runtime fields serve the same purpose (schema on read) but this is proprietary Elasticseach feature / implementation. The goal of this RFC is to provide similar functionality on OpenSearch side (but obviously it cannot be copied as is).

@lrynek
Copy link

lrynek commented Jan 26, 2022

@imRishN Oh, haven't known that ,thanks for explanation! 👍 // It's that I assumed that given OpenSearch is a fork of Elasticsearch 7 version, it would have all the features available for that version too. Have we got any reference for such discrepancies between the two projects? It would be awesome...😎

@dblock
Copy link
Member

dblock commented Feb 2, 2022

@imRishN Oh, haven't known that ,thanks for explanation! 👍 // It's that I assumed that given OpenSearch is a fork of Elasticsearch 7 version, it would have all the features available for that version too. Have we got any reference for such discrepancies between the two projects? It would be awesome...😎

OpenSearch forked at 7.10.2, so anything added in OpenSearch or ES since then is likely different.

@lrynek
Copy link

lrynek commented Feb 2, 2022

@dblock Thanks for explaining! 👍

@marekm-gain
Copy link

+1 for having this

@ryn9
Copy link

ryn9 commented Oct 13, 2022

@imRishN @rramachand21 @elfisher
Is this advancing? It looks like this keeps slipping?

@elfisher
Copy link

@imRishN is this being worked on?

@grahamplace
Copy link

Came from: https://forum.opensearch.org/t/runtime-fields-on-opensearch/9837

I'm bummed that opensearch doesn't support runtime fields — they seemed like the solution I needed for my project (was reading about ES, obviously), so I'm disappointed that I'm left without the feature having chosen OS over ES 😞

@rramachand21
Copy link
Member

We will be looking into this and updating this with a more accurate version where this will be available. As usual, opensource contributions are welcome :) If there is interest in contributing to this, please do reach out.

@svdasein
Copy link

svdasein commented Apr 5, 2023

@rramachand21 has this made it to the roadmap yet? Can you comment on status?

@khmelevskii
Copy link

@rramachand21 do you have a plan deliver it?

@hrbu
Copy link

hrbu commented Jun 7, 2023

Voting for this feature too. This would massively simplify our task to build an integrated view on distributed data. Currently we manage this by a prepocessing service resolving references before indexing.

@AhmedAbdoOrtiga
Copy link

AhmedAbdoOrtiga commented Jun 14, 2023

It's crazy that such a feature has still been in the backlog since 2021!

@dblock
Copy link
Member

dblock commented Jun 16, 2023

Please contribute!

@khmelevskii
Copy link

Please contribute or let us know your plan about this feature.

@rishabhmaurya
Copy link
Contributor

Is anyone working on it? If not, I can pick this up.

@rishabhmaurya
Copy link
Contributor

rishabhmaurya commented Oct 12, 2023

I'm thinking on the lines of using search pipelines and if we can make use/enhance search processor to achieve the same. Feedbacks/ideas welcome, thanks.
cc @msfroh @noCharger

@msfroh
Copy link
Collaborator

msfroh commented Oct 13, 2023

I'm thinking on the lines of using search pipelines and if we can make use/enhance search processor to achieve the same. Feedbacks/ideas welcome, thanks.
cc @msfroh @noCharger

Reading up on the runtime fields feature it sounds more like this is a lot like adding script fields to the mapping, so that you don't need to specify them at query time. As script fields, they're still computed at runtime. From the original issue description, it sounds like the problem with script fields is that they're bulky and awkward to inject into a query.

You might be able to simplify the syntax with a search request processor, essentially injecting script fields into the query wherever a "runtime field" is specified.

(At first I was thinking about this purely from the perspective of adding a field to results, which could also be done with a search response processor, but then it would be too late to use the field in filters and aggregations.)

@rishabhmaurya
Copy link
Contributor

Yes, if we add script fields reading from index mappings or search request runtime field, that should solve the problem.
I initially liked the idea of using request processor but then thought to keep it simple and intuitive as they should behave like normal fields without the need to configure search processor.

Also, going by the blog here - https://www.elastic.co/blog/getting-started-with-elasticsearch-runtime-fields, in elasticsearch its also possible to convert these fields to indexed fields at the time of rollover, which seems like a nice value add.

@msfroh
Copy link
Collaborator

msfroh commented Oct 13, 2023

Also, going by the blog here - https://www.elastic.co/blog/getting-started-with-elasticsearch-runtime-fields, in elasticsearch its also possible to convert these fields to indexed fields at the time of rollover, which seems like a nice value add.

Yeah -- I experimented a little bit with update_by_query (but not reindex) and found that you could "materialize" the output of a search request processor. This sounds pretty similar conceptually.

@rishabhmaurya
Copy link
Contributor

we will get some boiler plate code to implement it once #6836 is done on which I'm currently working on.

@khmelevskii
Copy link

Is there any timeframe to deliver it?

@rishabhmaurya
Copy link
Contributor

rishabhmaurya commented Jan 21, 2024

Here is the breakdown for the implementation -

Runtime field mapping parsing

  • Can be defined both in index settings and query dsl. Should give preference to query DSL if both are specified.
  • Create RuntimeFieldMapper which will extend Mapper class should have following additional attributes -
    • Type: intended field type
    • Script: associated script
   "runtime_mappings": {
     "<name>": {
       "type": "keyword",
       "script": {
         "source": "<script>"
       }
     }
   },
  • The MappedFieldType for it should behave as a regular field, of the type specified, to build a dummy query for it. And MappedFieldType should have (docValues=false; isIndexed=false; isStored=false) and textSearchInfo as SIMPLE_MATCH_ONLY_FIELD_TYPE.
  • When QueryShardContext gets created in SearchService.java, we can also parse and store the runtime mapping in QueryShardContext. At the time of field lookup of runtime field, it will return the Mapped field.

QueryBuilder and execution

  • The QueryBuilder should have following logic -
    • Very similar to SourceFieldMatchQuery and ScriptQueryBuilder with few differences.
    • It will accept the parent query (i.e. query formed from all other clauses in DSL except once including runtime field), dummy query created above and the runtime field mapper object.
    • It will create a custom weight function, just like SourceFieldMatchQuery and execute the parent query . In the two phase iterator matches(), it will fetch the value of matching document executing the painless script associated with the runtime mapper field object. It will create a Lucene MemoryIndex with just one field of type fetched from mapper field object and value just fetched using painless script by executing against matched document _source.

Aggregation

Scoring

Sorting

@reta
Copy link
Collaborator

reta commented Jan 22, 2024

@rishabhmaurya thanks for picking it up, I have nothing against runtime_fields but it looks like Elasticsearch has invented them at the first place, may be we should look for another name for it?

@rishabhmaurya
Copy link
Contributor

@reta that's a good point. Should we call them prototype field? Since they are meant for prototype purposes and not for permanent use.

@smacrakis
Copy link

I don't think we should name them for what we think they are best used for. I'd hope that they would have essentially zero runtime cost and so not be suitable only for prototypes.

There are lots of good names to choose from. I think in DBMS's they are called generated, virtual, computed, calculated, derived, etc.

Or is there some critical difference between the DBMS concept and the OpenSearch concept which needs to be emphasized?

@reta
Copy link
Collaborator

reta commented Jan 22, 2024

I like computed_fields or derived_fields, I think they fit very well the purpose, thank you @smacrakis

@rishabhmaurya
Copy link
Contributor

+1 for derived_fields, i hope we don't use it for any other purposes today in OpenSearch.

@dagneyb dagneyb added the v2.13.0 Issues and PRs related to version 2.13.0 label Jan 22, 2024
@khmelevskii
Copy link

It would be great to have the same name and interface as ElasticSearch for it

@reta
Copy link
Collaborator

reta commented Jan 23, 2024

It would be great to have the same name and interface as ElasticSearch for it

In theory - yes, in practice - we are asking for problems: Elasticsearch is not OSS

@rishabhmaurya
Copy link
Contributor

Good news - we are done with most of the implementation(#12281) and here is a little documentation (opensearch-project/documentation-website#6943). I encourage folks waiting for it to give it a shot using snapshot build and see if it meets their needs. Let us know if you have any feedback or suggestions, happy to incorporate them possibly before next version release.
Thanks @qreshi and @msfroh for your contributions.

@sohami sohami added RFC Issues requesting major changes Roadmap:Search Project-wide roadmap label labels May 14, 2024
@kkhatua kkhatua added the v2.15.0 Issues and PRs related to version 2.15.0 label May 20, 2024
@rishabhmaurya
Copy link
Contributor

rishabhmaurya commented Jun 26, 2024

This feature is released in 2.15 - https://opensearch.org/docs/latest/field-types/supported-field-types/derived/
with certain limitations, which we will be working on next. Current and future limitations are tracked as part of #12281

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes Roadmap:Search Project-wide roadmap label Search:Query Capabilities Search Search query, autocomplete ...etc v2.15.0 Issues and PRs related to version 2.15.0
Projects
Status: New
Status: Done
Status: Done
Status: Planned work items
Development

No branches or pull requests