Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve filtering configuration with globs and make match_type optional #1081

Closed
jrcamp opened this issue Jun 4, 2020 · 25 comments
Closed
Labels
enhancement New feature or request Stale
Milestone

Comments

@jrcamp
Copy link
Contributor

jrcamp commented Jun 4, 2020

This was originally brought up in #560 (comment) but was out of scope for the metric filtering PR.

Background

Today the type of matching pattern to use is done by explicitly setting match_type. For example:

        # metrics is the configuration for filtering metrics
        metrics:
            # include and/or exclude can be specified. However, the include properties
            # are always checked before the exclude properties.
            exclude:
                # match_type controls how items in "metric_names" arrays are
                # interpreted. Possible values are "regexp" or "strict".
                # This is a required field.
                match_type: regex

                # The metric name must match at least one of the items.
                # This is an optional field.
                meric_names: [metric1, metric2]

This has some downsides in that you must have a different include/exclude section for each match_type when the metric names you are matching against may logically fit into one section. Additionally the current model does not support globs which are in many cases sufficient when regex are overkill.

Proposal

The proposal is to specify match_type as part of the match pattern. It uses preexisting conventions that are likely familiar to most users. Additionally I'm proposing adding negating support to make cases easy where users want to match a wide pattern (as it would be too many to enumerate manually) but then negate matches later.

metrics:
    # include and/or exclude can be specified. However, the include properties
    # are always checked before the exclude properties.
    exclude:
        # The metric name must match at least one of the items.
        # This is an optional field.
        meric_names: [
            "host/cpu/*",       # glob matching
            "host/cpu/usage",   # exact matching
            "/cpu/",            # regex matching
            "!host/cpu/total"   # inverse match so that host/cpu/total isn't dropped
        ]

Globs are used extensively across many platforms, databases, shells, etc. The regex syntax of /cpu/ is in many languages like JavaScript, Perl, sed, etc. I'm not sure how extensive ! is used but it is used by gitignore for pattern matching.

Backwards Compatability

To keep existing configs backwards-compatible I'm proposing making match_type optional (it is currently require). If a user specifies a match_type value the current behavior will be maintained. If match_type is not set then the new syntax described here will be used.

@bogdandrutu
Copy link
Member

@jrcamp:

  1. Not sure I understand how the "/cpu/" is different from "host/cpu/usage". Can you explain how can you know when to apply fix match versus regexp match in this case?
  2. "!host/cpu/total" - this is just for convenience because you can still achieve this with a regexp, or am I missing something?

I was doing some refactoring around the filterset and found the current config hard to use as well. I do believe that it is an important optimization to know when a pattern is regexp or strict string comparison.

So my proposed alternative is:

metrics:
    # include and/or exclude can be specified. However, the include properties
    # are always checked before the exclude properties.
    exclude:
        # The metric name must match at least one of the items.
        # This is an optional field.
        meric_names:
            regexp:
                # other properties like cache
                filter_values: [
                    "host/cpu/*",       # glob matching
                ]
            strict:
                filter_values: [
                    "host/cpu/usage",   # exact matching
                ]

@jrcamp
Copy link
Contributor Author

jrcamp commented Jun 4, 2020

Not sure I understand how the "/cpu/" is different from "host/cpu/usage". Can you explain how can you know when to apply fix match versus regexp match in this case?

Only if it starts and ends with / would it be regex. host/cpu/usage doesn't start and end with / so it's just an exact match.

!host/cpu/total" - this is just for convenience because you can still achieve this with a regexp, or am I missing something?

How do you match host/cpu/.* with regex but not host/cpu/total. Is it possible without negative lookahead (which I don't think Go supports?)

You can see more examples here that we use in the SignalFx agent: https://docs.signalfx.com/en/latest/integrations/agent/filtering.html

I do believe that it is an important optimization to know when a pattern is regexp or strict string comparison.

You do know based on the syntax defined above of /cpu/ for regex and anything else is a glob. (Exact matching is a subset of glob matching) because host/cpu/usage evaluated like a glob is the same thing as doing exact matching, though maybe you can make it more efficient by bypassing glob evaluation.

@keitwb
Copy link
Contributor

keitwb commented Jun 4, 2020

In the Smart Agent, we settled on the way Jay is proposing because it was easier to write and understand from the end user perspective, with less syntax and structure to remember. Performance-wise it can be written in such a way that it is no different.

I'm not sure how necessary the negation is if you are going to keep the exclude/include sections. Is there any functionality it provides that cannot already be done with the existing config or is it more of a convenience? It isn't clear what it means when it says "the include properties are always checked before the exclude properties". Does that mean that if something matches in includes then it doesn't even both checking excludes?

@jrcamp
Copy link
Contributor Author

jrcamp commented Jun 4, 2020

I'm not sure how necessary the negation is if you are going to keep the exclude/include sections.

Yeah that's true, if you always have include and exclude sections it's probably not useful. I was thinking I saw some kind of filtering in OT that was just a single list without separate sections where I thought it might be useful but I forget where.

Can always be added later if needed. I think we can drop it from the proposal for now.

@bogdandrutu
Copy link
Member

Is any standard that defines /text/ to be a regexp vs strict check?

@jrcamp
Copy link
Contributor Author

jrcamp commented Jun 4, 2020

Is any standard that defines /text/ to be a regexp vs strict check?

You mean what if you have a filter that's /text/ that you want to match as an exact? Some possibilities:

  • You have to use the regex /^/text/$/ It's suboptimal but it's not a common case.
  • Could fall back to setting match_type but then we couldn't deprecate it.
  • Maybe could have some way to escape it but I couldn't think of anything very clean.

@keitwb best I can tell you'd just have to use the regex in SA, am I missing anything?

@bogdandrutu
Copy link
Member

But for your initial example why did you use /cpu/ instead of .*cpu.*?

@jrcamp
Copy link
Contributor Author

jrcamp commented Jun 4, 2020

For demonstration purposes only :) Yes a better example would probably have been /.*cpu.*/

The vast majority of the times globs are sufficient and easier though: *cpu*

@bogdandrutu
Copy link
Member

@jrcamp I am confused:

            "host/cpu/*",       # glob matching
            "host/cpu/usage",   # exact matching
            "/cpu/",            # regex matching

Then why do you need the first and third to be "different" I mean the are both glob matching.

@james-bebbington
Copy link
Member

james-bebbington commented Jun 5, 2020

Is any standard that defines /text/ to be a regexp vs strict check?

I don't know if there's an explicit standard that defines regex's to be enclosed in slashes, but its a very common convention (as Jay mentioned, this is the case for JavaScript, Perl, sed, etc) - see: https://en.wikipedia.org/wiki/Regular_expression#Delimiters

why do you need the first and third to be "different" I mean the are both glob matching.

Glob is different from regex: https://en.wikipedia.org/wiki/Glob_(programming)#Compared_to_regular_expressions


I quite like Jay's proposal for the ease of configuration, but a couple of questions/concerns:

  1. What if a user wanted to do a strict match on Foo*? I guess they could use /^Foo\*$/, but it makes a simple case quite complex.

  2. It's not immediately obvious how negation matches would be derived, i.e. consider:

include:
  # matches "foo" OR "foobar"
  names: [ "foo", "foobar" ]

include:
  # matches "foo*" OR "!foobar" ==> matches "bar"
  names: [ "foo*", "!foobar" ]

In the code we could treat negation differently by taking the union of all regular entries, and then subtract the negation entries from that, but I feel that is more confusing than just keeping the include/exclude filters.

  1. Where do the existing cacheenabled & cachemaxentries properties fit into this proposal if anywhere?

@jrcamp
Copy link
Contributor Author

jrcamp commented Jun 5, 2020

why do you need the first and third to be "different" I mean the are both glob matching.

Glob is different from regex: https://en.wikipedia.org/wiki/Glob_(programming)#Compared_to_regular_expressions

👍

  1. What if a user wanted to do a strict match on Foo*? I guess they could use /^Foo\*$/, but it makes a simple case quite complex.

Either regex or we allow * to be escaped (this means the escape has to be escaped as well). e.g.:

Foo\* for matching text Foo*
Foo\\ for matching text Foo\

  1. It's not immediately obvious how negation matches would be derived, i.e. consider:

Ben made a good point, there's really no need for negation when you have the include/exclude concept. However, negation is an alternative to having include/exclude if we wanted. See our docs in case you haven't yet for more details and examples on the negation. However I'm ok keeping include/exclude as it makes backwards compat easier.

  1. Where do the existing cacheenabled & cachemaxentries properties fit into this proposal if anywhere?

I think they would still fit under the regex key as optional, same as they do today.

@jrcamp
Copy link
Contributor Author

jrcamp commented Jun 5, 2020

Either regex or we allow * to be escaped (this means the escape has to be escaped as well). e.g.:

Foo\* for matching text Foo*
Foo\\ for matching text Foo\

Actually, using \ for escaping may not be very nice due to yaml trying to interpret the backslash as a regular string escape.

@bogdandrutu
Copy link
Member

What I am hearing is that there are corner cases when it comes to unify the filter values in order to determine a strict match vs regexp match, so probably having them separated is not that bad (also for regexp we do have the caching option) should we keep them separated? Also when it comes to the standard that we use for regexp probably following the golang is the easiest for us.

@jrcamp
Copy link
Contributor Author

jrcamp commented Jun 5, 2020

There are corner cases but do we choose to optimize for them or for the 99% case? I believe we should optimize for the 99% cases.

@bogdandrutu
Copy link
Member

I am 100% for user first, but having cpu* being able to be interpreted as regexp or as static comparison is not user friendly, and us inventing a new standard for this is also not "user friendly".

So because of that explicitly ask the user what to use is in my opinion more user friendly than having undefined behaviors or having our custom standard.

@jrcamp
Copy link
Contributor Author

jrcamp commented Jun 5, 2020

Talked to Bogdan offline and the concern is having corner cases with values like * and ? may not be an issue for metric names but could be more of an issue when filtering other kinds of values since the filtering mechanism is generic and not just for metrics. I'm unable to think of an easy way to do escaping in those cases either given the constraints of YAML. (Please let me know if there are ways.)

My proposal building on top of @bogdandrutu's suggestion. I'm proposing separate regex config into its own regexp_config section because for strict and glob they do not need any settings (seems unlikely they would ever need any) so this simplifies things a bit not having to have a filter_values for the most common cases.

metrics:
    # include and/or exclude can be specified. However, the include properties
    # are always checked before the exclude properties.
    exclude:
    # The metric name must match at least one of the items.
    # This is an optional field.
    meric_names:
        regexp_config:
            cache_enabled: true
        
        regexp:
        - host/cpu/.*

        strict:
        - host/cpu/usage

        glob:
        - host/*

Backwards compatibility would be done by checking whether match_type is set or not. If not set it would use the new configuration.

@anniefu
Copy link
Contributor

anniefu commented Jun 5, 2020

metrics:
    # include and/or exclude can be specified. However, the include properties
    # are always checked before the exclude properties.
    exclude:
    # The metric name must match at least one of the items.
    # This is an optional field.
    meric_names:
        regexp_config:
            cache_enabled: true
        
        regexp:
        - host/cpu/.*

        strict:
        - host/cpu/usage

        glob:
        - host/*

The UX of setting the match type explicitly here is nice. I agree with @bogdandrutu that a new standard, even if it is more intuitive, may not necessarily be more friendly to users.

I do want to raise the concern here that allowing multiple filter types in the same filter will increase processing on the backend. We'd have to run all the incoming metrics/traces through each specified filter type. It'd be more performant for the user to write their strict matches in regexp if they also need regexp power.

@jrcamp
Copy link
Contributor Author

jrcamp commented Jun 9, 2020

I do want to raise the concern here that allowing multiple filter types in the same filter will increase processing on the backend. We'd have to run all the incoming metrics/traces through each specified filter type. It'd be more performant for the user to write their strict matches in regexp if they also need regexp power.

I'm not I understand the performance concern. At startup time we "parse" the config and collect a set of rules that are to be evaluated whether they be strict, glob, or regex. I'm not sure what the extra overhead is when processing metrics?

@james-bebbington
Copy link
Member

james-bebbington commented Jun 9, 2020

The new proposal looks good to me.

A couple of minor thoughts:

  1. regexp_config could probably be inlined (squashed)

  2. Does it make sense to have the include filter or field name at the top level?

i.e. consider

include:
  metric_names:

vs

metric_names:
  include:

@anniefu
Copy link
Contributor

anniefu commented Jun 13, 2020

I'm not I understand the performance concern. At startup time we "parse" the config and collect a set of rules that are to be evaluated whether they be strict, glob, or regex. I'm not sure what the extra overhead is when processing metrics?

Sorry, I should have clarified, I was mostly thinking about the case with caching matches enabled.

For example, to support a mixed regexp & strict & glob filterprocessor, I could see some processing logic like:

if regexpFilters.Match(metric) ...
else if strict.Match(metric)...
else if glob.Match(metric)...

The regexp filterset has an option to use a cache to prevent re-running through all the regexp filters right now, but then you have the additional overhead of running through strict and glob-type filters again (though they could also cache separately).

In this case, it's probably best to have the caching option on the top-level of the filter config instead of just regexp_config, so you could do something like this:

if prevResult, ok := cache[metricName] {
return prevResult
}
if regexpFilters.Match(metricName) ...
else if strict.Match(metricName)...
else if glob.Match(metricName)...

@jrcamp jrcamp added this to the Backlog milestone Jul 30, 2020
@james-bebbington
Copy link
Member

What is the status of this? Do we still want to make changes to the filter config or are we okay with how it currently is?

Is it possible to make these packages not internal so that they can be used in Contrib?

@jrcamp @bogdandrutu

@james-bebbington
Copy link
Member

james-bebbington commented Oct 13, 2020

Another Proposal:

I looked at what a few other systems do. I didn't find anything super useful to be gleaned as most systems use filter configuration that is specific to the task at hand, and it seems relatively rare to offer strict, glob, and regex options.

Anyway. here is an alternative to what Jay suggested where we keep the "type" property and change how include/exclude works to reduce the amount of nesting needed.

Goal 1: Streamline the Config

At the moment, the config can be a little verbose. Some simplifications we can make:

  • Remove a level of nesting by making include/exclude leaf level instead of root level
  • Remove a level of nesting by squashing regex options (which will be relevant for glob as well)
  • Make "strict" the default so "type" can be omitted for simple cases

Proposal:

include: <string or array>
exclude: <string or array>
type: <strict | glob | regexp> # strict by default
cache_enabled: <bool> # optional
cache_maxentries: <int> # only applies if cache_enabled = true

Two limitations of this approach:

  • You can't have a different "type" for include & exclude filters, but that doesn't limit the functionality as anything that could be expressed via strict or glob can be expressed via regexp if needed. This optimizes for the 99% case.
  • You can't give a custom name to the filter field. I don't think this matters as if a user wants to filter on two different fields, they can just introduce a level of nesting (see the second example below)

Simple case:

Current:

filter:
  metrics:
    include:
      metric_names: "hello world"
      match_type: strict

With Proposal:

filter:
  metrics:
    include: "hello world"

Multiple filters:

Current:

hostmetrics:
  scrapers:
    filesystem:
      include_devices:
        devices: [ "^[A-Z]:$" ]
        match_type: regexp
        regexp_config:
          cache_enabled: true
      exclude_devices:
        devices: [ "^D:$", "^E:$" ]
        match_type: regexp
        regexp_config:
          cache_enabled: true
      include_fs_types:
        fs_types: "Ext2"
        match_type: strict

With Proposal:

hostmetrics:
  scrapers:
    filesystem:
      devices:
        include: [ "^[A-Z]:$" ]
        exclude: [ "^D:$", "^E:$" ]
        match_type: regexp
        cache_enabled: true
      fs_types:
        include: "Ext2"

Goal 2: Simplify the Code to use the filter

The current code requires quite a bit of additional scaffolding to use. The suggested new config should simplify this significantly:

  • Embeds include & exclude within one filter object, reducing the number of filtersets that need to be used by half
  • Removes the need to create a Filter config struct to set the name of the field being filtered on

Current:

type Config struct {
    MyFieldIncludeFilter MyFieldMatchConfig `mapstructure:",include_my_field"`
    MyFieldExcludeFilter MyFieldMatchConfig `mapstructure:",exclude_my_field"`
}

type MyFieldMatchConfig struct {
    filterset.Config `mapstructure:",squash"`
    MyFields []string `mapstructure:"my_fields"`
}

...

# initialization code
includeFilter, err := filterset.CreateFilterSet(items, cfg.MyFieldIncludeFilter)
if err != nil {
    return nil, err
}

excludeFilter, err := filterset.CreateFilterSet(items, cfg.MyFieldExcludeFilter)
if err != nil {
    return nil, err
}

With Proposal:

type Config struct {
    MyFieldFilter MatchConfig `mapstructure:",my_field"`
}

...

# initialization code
filter, err := filterset.CreateFilterSet(items, cfg.MyFieldFilter)
if err != nil {
    return nil, err
}

The filtermetric & filterspan packages will become almost redundant, save for one helper function.

Goal 3: Expose the filters for use externally

Self explanatory. Blocks open-telemetry/opentelemetry-collector-contrib#1088 (comment)

@jrcamp
Copy link
Contributor Author

jrcamp commented Oct 13, 2020

I think it's an improvement and I think the downsides are fine. Will we be able to keep backwards compatibility in a reasonable way or would it require breaking configs?

@james-bebbington
Copy link
Member

It would require breaking configs 😦

But I don't think that's avoidable if we want to simplify the config

@amanbrar1999
Copy link
Contributor

amanbrar1999 commented Oct 20, 2020

@jrcamp @james-bebbington can Goal 3 of this proposal be done independently of the others in order to unblock issues open-telemetry/opentelemetry-collector-contrib#1088 and #1892? or do Goals 1 and 2 stand as prerequisites?

edit: This was answered in #1892

@andrewhsu andrewhsu added enhancement New feature or request and removed feature request labels Jan 6, 2021
MovieStoreGuy pushed a commit to atlassian-forks/opentelemetry-collector that referenced this issue Nov 11, 2021
…lemetry#1081)

* Bump github.com/itchyny/gojq from 0.11.0 to 0.11.1 in /tools

Bumps [github.com/itchyny/gojq](https:/itchyny/gojq) from 0.11.0 to 0.11.1.
- [Release notes](https:/itchyny/gojq/releases)
- [Changelog](https:/itchyny/gojq/blob/master/CHANGELOG.md)
- [Commits](itchyny/gojq@v0.11.0...v0.11.1)

Signed-off-by: dependabot[bot] <[email protected]>

* Auto-fix go.sum changes in dependent modules

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: dependabot[bot] <dependabot[bot]@users.noreply.github.com>
Co-authored-by: Tyler Yahn <[email protected]>
@github-actions github-actions bot added the Stale label Jan 11, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 13, 2023
hughesjj pushed a commit to hughesjj/opentelemetry-collector that referenced this issue Apr 27, 2023
…etry#1081)

Bumps [requests](https:/psf/requests) from 2.27.0 to 2.27.1.
- [Release notes](https:/psf/requests/releases)
- [Changelog](https:/psf/requests/blob/main/HISTORY.md)
- [Commits](psf/requests@v2.27.0...v2.27.1)

---
updated-dependencies:
- dependency-name: requests
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Troels51 pushed a commit to Troels51/opentelemetry-collector that referenced this issue Jul 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Stale
Projects
None yet
Development

No branches or pull requests

7 participants