Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provide data types for semver and semver_range to enable indexing and querying semantic version values #48878

Closed
geekpete opened this issue Nov 6, 2019 · 31 comments
Assignees
Labels
>enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Deployment Management Meta label for Management Experience - Deployment Management team Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch

Comments

@geekpete
Copy link
Member

geekpete commented Nov 6, 2019

Describe the feature:

Provide dedicated types to index and query against semantic versions with a semver and a semver_range type.

Use of Semantic Versioning is so widespread that a dedicated semver type and semver_range type will be useful primitives to add to Elasticsearch for varied use cases.

### semantic version type and semantic version range type example usage

PUT /semver-test
{
  "mappings": {
    "properties": {
      "message": "text",
      "version": {
        "type": "semver"
      },
      "compatible_versions": {
        "type": "semver_range"
      }
    }
  }
}

# example app1
POST /semver-test/_doc/1
{
  "app": "coolapp-ce",
  "version": "1.3.2",
  "compatible_versions": {
    "gte" : "1.0.0",
    "lte" : "1.4.0"
  }
}

# example app2
POST /semver-test/_doc/1
{
  "app": "coolapp-TNG",
  "version": "6.1.9",
  "compatible_versions": {
    "gte" : "6.0.0-alpha1",
    "lt" : "6.2.0"
  }
}

# example app3
POST /semver-test/_doc/1
{
  "app": "coolapp-final",
  "version": "6.2.1",
  "compatible_versions": {
    "gte" : "6.2.0-alpha1",
    "lt" : "6.3.0"
  }
}

# semver range query
GET /semver-test/_search
{
    "query": {
        "range" : {
            "version" : {
                "gte" : "6",
                "lte" : "6.2.0"
            }
        }
    }
}

# query against semver range type
GET /semver-test/_search
{
  "query": {
    "term": {
      "compatible_versions": {
        "value": "6.2.0-beta3"
      }
    }
  }
}

More advanced range combinations might be interesting to implement such as how Python's pip tool allows versions to be specified:
https://www.python.org/dev/peps/pep-0440/#version-specifiers

@imotov imotov added the :Search Foundations/Mapping Index mappings, including merging and defining field types label Nov 8, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Mapping)

@bczifra
Copy link
Member

bczifra commented Nov 29, 2019

The discussion around integer/long mappings seems relevant to this issue.

@ejsmith
Copy link

ejsmith commented Dec 3, 2019

Yes please!! This would be super helpful.

@webmat
Copy link

webmat commented Dec 5, 2019

ECS captures versions in straightforward .version fields, in order to be able to capture whatever messy versions are out there in the wild. Semver are the "cleanest" versions, but there's also date-based versions (like Ubuntu's), there's composite versions like RHEL's package release versions & so on.

My ideal scenario would be to have a datatype we can add directly to existing .version fields as a multi-field, and would silently ignore what's not semver.

Here's a few things I'd like to see supported:

  • depth from one numeric (1) to at least 3 (the typical 1.2.3), and ideally 4 (like Chrome's 78.0.3904.108)
  • I'm split on whether we should specifically support labels (e.g. 1.2.3-beta1). Ignoring them and still indexing the numeric part would be great, as a start. If we can eventually support labels as well, even better.
  • I would love to see range queries on the semver datatype, e.g. tls.version >= 1.2. For me this is the highest value we can get from this.
    • I'd love to see support for gt, gte, lt, lte as the basics
    • Bonus points if we can get semantically nicer operators akin to Rubygem's ~> 1.2.3 which essentially means >= 1.2.3 and < 1.3. But this can be accomplished with the 4 building blocks gt, gte, lt and lte. So less pressing.

I don't currently have a clear use case for semver_range, but this could also prove useful.

@jpountz
Copy link
Contributor

jpountz commented Dec 9, 2019

Thanks @webmat, this is very useful. I agree ranges are important. I think we could support versions with the following restrictions:

  • <numeric identifier>s for major, minor and patch can't be greater than 65535 (so that they can be stored on 2 bytes)
  • <pre-release> is limited to 13 characters (can be stored on 10 bytes since there are only 38 allowed characters)

The build number doesn't need to be part of the indexed representation since it doesn't matter for ordering. This would allow to store versions as a 16 bytes integer, which is the maximum that Lucene supports. Do these restrictions feel ok?

@webmat
Copy link

webmat commented Dec 11, 2019

I think the 2 bytes restriction per number and the overall 16 byte integer representation are reasonable, yes.

Looking at your comment before edits, I was about to say I didn't expect us digging into or interpreting the labels that much (the number in alpha.1). I would specifically avoid trying to parse labels, actually. There's conventions there, but people do weird things with these labels. You'll note that I'm not even calling them "pre-release", but specifically "label" which doesn't impose a semantic meaning. 2 examples to support this:

Whenever we reference this part of a version string, I would suggest we name it "label" or something that doesn't impose a semantic meaning.

To confirm, is it correct to say that we would parse and index up to the first 3 numbers, but not the 4th (e.g. in longer versions such as Chrome's "78.0.3904.108")?

I'm good with this, as long as the 4 number versions do get parsed successfully. If someone is looking for an exact build number, it they should search for the exact match of the version string, not version.build:108

I'm also good with truncating the label at 13 bytes. Or were you suggesting 1.2.3-really-really-long-label would fail parsing?

@jpountz
Copy link
Contributor

jpountz commented Dec 12, 2019

My worry is that the specification is very clear about how these labels should be used to compute precedence, and I worry that users will be caught by surprise if 1.0.0-alpha2 doesn't compare less than 1.0.0-rc1 or if 1.0.0-alpha.2 doesn't compare less than 1.0.0-alpha.10.

Regarding versions that have long labels, I believe we would be able to not reject them by storing the prefix in the index and the whole thing in doc values. Then range queries would use the index for the first 13 characters and fall back to doc values when version labels are longer, which should be rare.

How to handle non-semver version schemes is a good question. For instance, when there is no ambiguity, we could support additional schemes directly e.g. major.minor.patch.label (dot instead of dash) for Chromium versions. The Redhat version sounds more complicated to support to me since I guess we will want to support major.minor too, without patch and label information and this might introduce ambiguity. Or maybe this field should be more structured to give more flexibility, e.g. something like below instead of a single string. Note that everything would still get indexed in a single field.

"version": {
  "major": 5,
  "minor": 0,
  "patch": 0,
  "label": "el6"
}

@webmat
Copy link

webmat commented Dec 12, 2019

Point no 9 of Semver states:

Pre-release versions have a lower precedence than the associated normal version.

Supporting that part makes sense and is straightforward. It states that all versions that include "a pre-release label" are lower precedence -- or earlier version -- than the same version without a pre-release label. 1.0.0-alpha < 1.0.0. I'm 👍 to support this, and I retract my insistence on calling this "labels"; calling this "prerelease" will give users a better mental model on how the sorting works.

But to me the spec seems to give up on specifying anything in the meaning of the acceptable labels, in terms of comparing one label to another. If you focus on the last two example labels they give:

Examples: 1.0.0-alpha, 1.0.0-alpha.1, 1.0.0-0.3.7, 1.0.0-x.7.z.92

These examples are actually all over the place. So I think working to support sensible sorting based on the label part of the version may end up being a lot of effort, for something that's not actually defined by SemVer 2.

I think a useful approach to handling the sorting of the labels would actually to also "give up" and just do lexical sorting.

Pros:

  • simpler to implement
  • simple to understand
  • will "work" for all sorts of made-up label combinations or transitions between labeling strategies

Cons:

  • has the typical problems of lexical sorting where "beta11" < "beta2"

Glancing at version numbers of various package lists like my workstation's, homebrew's or a Linux distro's is a good reminder of how narrow of a corner case sensible interpretation of the labels is. A significant percentage seem to not follow SemVer (maybe 10%?), the majority seems to follow a sensible major.minor.patch format, but few actually have labels; those that do, rarely seem to follow the a logic close to -[label][optional numeric].

I'd rather we spend our cycles on making version datatype support resilient to the craziness seen in the wild. Ideas:

  1. Try semver parsing with prerelease labels having lower precedence than non labeled version, then lexical sorting of the labels
  2. Fall back to parsing numbers separated by dots and ignoring what doesn't match ("1.2.3.4", "1.0beta4+r3088" => 1.0, "17.0.0.Final" => 17.0.0, "2005-02-20" => 2005.02.20)
  3. Fall back to lexical sorting of the even crazier version strings ("00-5.0.5", "r2917")

I'm also ok if we don't implement a fall back like described above, as long as version parsing of an unsupported format doesn't fail indexing. That part is absolutely critical, IMO.

@jpountz
Copy link
Contributor

jpountz commented Dec 12, 2019

But to me the spec seems to give up on specifying anything in the meaning of the acceptable labels, in terms of comparing one label to another.

I don't have the same reading. These sentences in particular are only about how versions with the same major, minor and patch should compare depending on their labels:

Precedence for two pre-release versions with the same major, minor, and patch version MUST be determined by comparing each dot separated identifier from left to right until a difference is found as follows: identifiers consisting of only digits are compared numerically and identifiers with letters or hyphens are compared lexically in ASCII sort order. Numeric identifiers always have lower precedence than non-numeric identifiers. A larger set of pre-release fields has a higher precedence than a smaller set, if all of the preceding identifiers are equal.

Let me try to think more about whether I can find a way to index versions that never fails yet still honors precedence rules of semver.

@webmat
Copy link

webmat commented Dec 12, 2019

You're right, I overlooked point no 11, which does a better job at defining precedence between labels of a same version.

I think my point still stands on how much of a corner case sorting semver labels is, vs their "correct" usage in the wild, however.

@jpountz
Copy link
Contributor

jpountz commented Dec 13, 2019

I think my point still stands on how much of a corner case sorting semver labels is, vs their "correct" usage in the wild, however.

I agree with this statement, but I know some people can be very picky about it. I also like the fact that we could just direct people to the semver docs to explain how precedence works vs. having to define our own rules.

I played a bit with encoding, it looks possible to accept arbitrary strings yet honor precedence rules for version that use the semver scheme. Precedence would also work as expected for variants like XX.YY (like Ubuntu) or X.Y.Z.B (like Chromium). That said I think it'd be good to enforce some simple rules, e.g.

  • only printable ASCII characters
  • a reasonable maximum length (e.g. 32)

@webmat
Copy link

webmat commented Dec 13, 2019

Yes, I agree with these restrictions.

From the POV of ECS, this datatype would likely be used as a multi-field appended to the various pre-existing .version keyword fields. So I don't mind if we end up being pretty strict on this proposed datatype. If users are dealing with a package/library/service that has a crazy version scheme, they can always use the keyword field instead of its nicer .version.semver multi-field.

And thinking more about the crazy version strings, I guess they would kind of work anyway. The "r2917" above is a real version seen on Homebrew. If this parses to 0.0.0 + pre-release label "r2917", I guess the sorting will work as expected, when sorted against "r2916" and "r2918" 🤷‍♂ 🙂

@jpountz
Copy link
Contributor

jpountz commented Dec 13, 2019

@webmat What are the reasons why you would not only index the field as a version?

The encoding strategies would have sorted r2917 in ASCII order, so it would compare less then r3 for instance, but we could configure the encoding to sort numbers after a r in numeric order.

@webmat
Copy link

webmat commented Dec 13, 2019

@webmat What are the reasons why you would not only index the field as a version?

Right now the fields are already defined as keyword in ECS. I assume the version datatype would not be compatible with keyword, in terms of querying over indices that have a mix of the two?

From ECS' point of view, this type incompatibility would be a breaking change. It's something we would only consider doing when ECS turns 2.0, which we would align with Elastic Stack 8.0.

@jpountz
Copy link
Contributor

jpountz commented Dec 17, 2019

I assume the version datatype would not be compatible with keyword, in terms of querying over indices that have a mix of the two?

This is something we could make work.

@jpountz
Copy link
Contributor

jpountz commented Dec 17, 2019

To be clear, search would work out of the box, only aggregations would require some work to make sure you can do e.g. a terms aggregation across keyword and version fields.

@webmat
Copy link

webmat commented Jan 21, 2020

Would there be a way to query in general for versions that have a pre-release label?

E.g. if I look into my infrastructure's package and software versions, I'd like to be able to query for any version number that includes a label, no matter what the label is.

@jpountz
Copy link
Contributor

jpountz commented Jan 21, 2020

It wouldn't come for free. In my opinion, we should either require this to be done on-top (possibly with an ingest processor)with a separate field, e.g. { "version": "6.0.0-alpha1", "pre_release" : "alpha1" } and run an exists query or { "version": "6.0.0-alpha1", "has_pre_release": true } and run a term query.

Or we could do it under the hood by indexing hidden fields like major, minor, patch and pre-release and have a special query for versions that would allow to query these fields independently.

@XavierRamosORGADATA
Copy link

We have 2 use cases that could be benefited by the above mentioned fields: major, minor, patch and build.
We use 4 sequential dot-separated numerical fields for our software i.e. (9.2.23.145) and we also use Windows versions information (also 4 numerical fields) i.e. 10.0.383.552

@rjernst rjernst added the Team:Search Meta label for search team label May 4, 2020
@afharo
Copy link
Member

afharo commented May 26, 2020

Hi! Just popping over to add a use case. In our telemetry team, we usually run reports based on versions: i.e.: I want to know X ratio for all 7.6+ clusters.

We currently use an analyzer like:

{
    "analysis": {
      "normalizer": {
        "major": {
          "char_filter": ["major"]
        },
        "minor": {
          "char_filter": ["minor"]
        },
        "patch": {
          "char_filter": ["patch"]
        }
      },
      "char_filter": {
        "major": {
          "pattern": "^(\\d+)(.*)",
          "type": "pattern_replace",
          "replacement": "$1"
        },
        "minor": {
          "pattern": "^(\\d+\\.\\d+)(.*)",
          "type": "pattern_replace",
          "replacement": "$1"
        },
        "patch": {
          "pattern": "^(\\d+\\.\\d+\\.\\d+)(.*)",
          "type": "pattern_replace",
          "replacement": "$1"
        }
      }
    }
}

And our version field has the following mappings:

      {
        "type": "keyword",
        "fields": {
          "major": {
            "type": "keyword",
            "normalizer": "major"
          },
          "minor": {
            "type": "keyword",
            "normalizer": "minor"
          },
          "patch": {
            "type": "keyword",
            "normalizer": "patch"
          }
        }
      }

As you might expect, the field is ultimately analysed as a keyword, so a query like the one below (I want to bring anything with version.minor >= 7.6 will fail to return versions like 7.10.1 (if we ever reach that version in the stack).

{
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "version.minor": {
              "gte": "7.6"
            }
          }
        }
      ]
    }
  }
}

On top of that, that doesn't add any semver validation, so I can store any value in it:
image

@cbuescher
Copy link
Member

I started looking into possible encoding schemes for this and have a POC that would allow using a wider range of version schemes, including but not limited to the Semantic Versioning scheme. My current assumptions are:

  • version strings consist of numbers and printable ascii-characters (probably only alphanumerical) plus a few characters with “special” meaning, e.g. ‘.’ to separate different version parts, “-“ to separate a “pre-release” or other label-like part and “+…” for an optional “buildNumber” section that doesn’t have defined precedence rules. So basically, borrowing from the BNF Semantic Versioning 2.0.0 | Semantic Versioning, we’d have:

      <version core> (-<pre-release>) (+<build>)
    

Where only <version core> would be mandatory

In extension to the strict SemVer specs and its precedence rules, we could easily further support:

  • variable length numeric “version core” parts, e.g. the mentioned Firefox four-digit scheme (“78.0.3904.108”)
  • different precedence rules for the pre-release part. SemVer e.g sorts mixed alphanumeric identifiers like “alpha11” and “alpha2” alphabetically, so “alpha11” < “alpha2”. We could allow a configuration option on the field to make this consider numeric blocks and sort them numerically instead, so “alpha2” < “alpha11”. This would require configuration by the user of the field however and affect the whole field.

We can encode the optional “build” part into the same field to allow exact matching on it if we simply say that we don’t ensure any specific ordering for that part. The SemVer specs say that “Build metadata must be ignored when determining version precedence. Thus two versions that differ only in the build metadata, have the same precedence.”, but I’d say from a practical point of view it’s enough if we don’t guarantee any precedence here. When e.g. sorting values one has to decide on some sort of ordering anyway.

The options sketched out above would still not allow for versions like the RedHat “5.el6” mentioned earlier on this issue. It would be possible to also allow alphanumeric ids in the part, but I wonder how frequent these cases would be. To keep the number of options low, I wonder if for those cases it wouldn’t be better to solve cases like that on ingestion / preprocessing to convert to something like “5-el6” which we could handle with the POC encoding.

The POC allows for exact searching, ranges (like "gte" : 1.1.0 or things like "gte" : 2.99.99, "lte" : 3.0.0 to e.g. search only 3.0.0 pre-release version). It would also allow for a version_range field with behaviour very similar to what we e.g. have on other range fields today already.

@axw
Copy link
Member

axw commented Jun 23, 2020

different precedence rules for the pre-release part. SemVer e.g sorts mixed alphanumeric identifiers like “alpha11” and “alpha2” alphabetically, so “alpha11” < “alpha2”. We could allow a configuration option on the field to make this consider numeric blocks and sort them numerically instead, so “alpha2” < “alpha11”. This would require configuration by the user of the field however and affect the whole field.

This reminded me of natsort. Just thought I'd share, in case it's unknown. Might be useful for inspiration.

@cbuescher cbuescher self-assigned this Jul 8, 2020
@alisonelizabeth alisonelizabeth added the Team:Deployment Management Meta label for Management Experience - Deployment Management team label Jul 8, 2020
cbuescher pushed a commit to cbuescher/elasticsearch that referenced this issue Jul 20, 2020
This PR adds a new 'version' field type that allows indexing string values
representing software versions similar to the ones defined in the Semantic
Versioning definition (semver.org). The field behaves very similar to a
'keyword' field but allows efficient sorting and range queries that take into
accound the special ordering needed for version strings. For example, the main
version parts are sorted numerically (ie 2.0.0 < 11.0.0) whereas this wouldn't
be possible with 'keyword' fields today.

Valid version values are similar to the Semantic Versioning definition, with the
notable exception that in addition to the "main" version consiting of
major.minor.patch, we allow less or more than three numeric identifiers, i.e.
"1.2" or "1.4.6.123.12" are treated as valid too.

Relates to elastic#48878
cbuescher pushed a commit that referenced this issue Sep 21, 2020
This PR adds a new 'version' field type that allows indexing string values
representing software versions similar to the ones defined in the Semantic
Versioning definition (semver.org). The field behaves very similar to a
'keyword' field but allows efficient sorting and range queries that take into
accound the special ordering needed for version strings. For example, the main
version parts are sorted numerically (ie 2.0.0 < 11.0.0) whereas this wouldn't
be possible with 'keyword' fields today.

Valid version values are similar to the Semantic Versioning definition, with the
notable exception that in addition to the "main" version consiting of
major.minor.patch, we allow less or more than three numeric identifiers, i.e.
"1.2" or "1.4.6.123.12" are treated as valid too.

Relates to #48878
cbuescher pushed a commit to cbuescher/elasticsearch that referenced this issue Sep 21, 2020
This PR adds a new 'version' field type that allows indexing string values
representing software versions similar to the ones defined in the Semantic
Versioning definition (semver.org). The field behaves very similar to a
'keyword' field but allows efficient sorting and range queries that take into
accound the special ordering needed for version strings. For example, the main
version parts are sorted numerically (ie 2.0.0 < 11.0.0) whereas this wouldn't
be possible with 'keyword' fields today.

Valid version values are similar to the Semantic Versioning definition, with the
notable exception that in addition to the "main" version consiting of
major.minor.patch, we allow less or more than three numeric identifiers, i.e.
"1.2" or "1.4.6.123.12" are treated as valid too.

Relates to elastic#48878
cbuescher pushed a commit that referenced this issue Sep 21, 2020
This PR adds a new 'version' field type that allows indexing string values
representing software versions similar to the ones defined in the Semantic
Versioning definition (semver.org). The field behaves very similar to a
'keyword' field but allows efficient sorting and range queries that take into
accound the special ordering needed for version strings. For example, the main
version parts are sorted numerically (ie 2.0.0 < 11.0.0) whereas this wouldn't
be possible with 'keyword' fields today.

Valid version values are similar to the Semantic Versioning definition, with the
notable exception that in addition to the "main" version consiting of
major.minor.patch, we allow less or more than three numeric identifiers, i.e.
"1.2" or "1.4.6.123.12" are treated as valid too.

Relates to #48878
@cbuescher
Copy link
Member

The PR adding the main "version" field type was merged with #62692.
There is a follow up PR (currently only draft) adressing the ask for a version range field, similar to ip or date ranges. So I'm keeping this open for now.

@fredeil
Copy link

fredeil commented Oct 5, 2020

Can't wait for this feature @cbuescher. When do you think it will be generally available? I guess the feature has to be added to the different SDKs also

@cbuescher
Copy link
Member

When do you think it will be generally available?

The main new field type has been merged to the 7.x branch which should go out with the upcoming 7.10 release if there are no last-minute changes. We don't give estimates of release dates etc... here, but the minor version release cadence has been pretty stable, so you can guestimate.

I guess the feature has to be added to the different SDKs also

I'm not sure I fully understand this. What SDKs do you mean?

@geekpete
Copy link
Member Author

geekpete commented Oct 5, 2020

SDKs as in support in language clients?

@fredeil
Copy link

fredeil commented Oct 6, 2020

@geekpete @cbuescher yeah, the NEST C# client for example.

@cbuescher
Copy link
Member

@fredeil Thanks for the clarification. The "version" field type should be usable like any other specialized field type via the REST interface. If the client API has special methods for creating field in index mappings for every field type possible it might need updating. I only know the Java high level client well enough to answer this but there you provide a map of field definitions thats not strongly typed, so no updates needed there. My guess would be that its similar for the C# client.

@ebeahan
Copy link
Member

ebeahan commented Oct 28, 2020

I wanted to circle back on an earlier point in the discussion: #48878 (comment)

The exchange suggested that version may be compatible with keyword, similar to what was done with the recently introduce wildcard field type included as part of the keyword type family. We had hoped to adopt version into ECS as a non-breaking change, similar to a migration taking place for fields transitioning types from keyword to wildcard.

After experimenting with the version type a bit, it appears to not be part of the keyword family. Would version ever be considered for the keyword family, or do underlying differences exist that prevent keyword compatibility?

With either path, ECS is still looking forward to adopting version 😄 . If the types will be always be incompatible, it may require us to plan on introducing this later on as a breaking change instead.

@cbuescher
Copy link
Member

Keeping this issue open because I'd still like to add a version_range field similar to other range fields, something that was initially requested and I think it still would be a good addition to the version field. The implementation sketched out in this WIP seems doable, I think it would need a bit of polishing still but I'd like to pick this up again in the near future. Keeping this here as a note.

@javanna
Copy link
Member

javanna commented Nov 16, 2022

The version field has been added, version range is tracked separately n #83995 . Closing this issue.

@javanna javanna closed this as completed Nov 16, 2022
@javanna javanna added Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Deployment Management Meta label for Management Experience - Deployment Management team Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch
Projects
None yet
Development

No branches or pull requests