Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discuss strategies to reduce number of fields #2839

Closed
jsoriano opened this issue Mar 17, 2022 · 29 comments
Closed

Discuss strategies to reduce number of fields #2839

jsoriano opened this issue Mar 17, 2022 · 29 comments
Labels
discuss Stalled Team:Ecosystem Label for the Packages Ecosystem team [elastic/ecosystem]

Comments

@jsoriano
Copy link
Member

There are some packages in this repository with too many fields on their data streams. Having too many fields can lead to performance degradation and is usually discouraged (see related ES docs).

On elastic/package-spec#278 we introduced a limit of 1024 fields per datastream, but we had to increase it to 2048 in elastic/package-spec#294 to keep builds green for some packages. We would like to be able to have a lower limit, at least by default.

Some questions to discuss:

  • Is it needed to index all these fields?
  • Could some of them be replaced by runtime fields? Is there something missing in the tooling or Fleet to support this?
  • Could some of them be replaced by flattened objects? Is there something missing to support this?
  • Is this number of fields expected? Should we add a mechanism to optionally increase the limit per data stream to support these cases?
  • Other ideas to address this?

The packages with more than 1024 fields are:

  • Netflow, cc @elastic/security-external-integrations
  • Osquery Manager, cc @elastic/security-asset-management

cc @ruflin @mtojek

@jsoriano jsoriano added discuss Team:Ecosystem Label for the Packages Ecosystem team [elastic/ecosystem] labels Mar 17, 2022
@mtojek
Copy link
Contributor

mtojek commented Mar 17, 2022

Let's ping folks behind these packages:

Netflow @andrewkroh @marc-gr
Osquery Manager @melissaburpo @aleksmaus

@andrewkroh
Copy link
Member

andrewkroh commented Mar 17, 2022

It would be helpful to break down the contents of the netflow integrations fields so I've included a table for reference (also I looked at the all the data streams https://gist.github.com/andrewkroh/885e28b1cdafbeacf0fca20b062a6de2).

There are 1322 netflow.* fields and 425 other fields. The netflow fields are come from IANA specifications (~500) and various vendor extensions. Each field has a specified data type. netflow.* alone puts us over the 1024 limit.

Looking at this list, I suspect that a small number of the included ECS fields are unused. Like as.* is normally not used at the root of an event per ECS.

I'm open to ideas, but I don't see a good way to drastically reduce the field count for netflow.

Count Namespace
2 @timestamp
5 agent
2 as
30 client
16 cloud
10 container
3 data_stream
31 destination
18 dns
1 ecs
5 error
22 event
22 file
2 flow
8 geo
3 group
4 hash
38 host
10 http
1 input
1 labels
11 log
1 message
1322 netflow
11 network
23 observer
2 organization
6 os
10 package
16 process
1 related
30 server
7 service
31 source
1 tags
7 threat
1 trace
1 transaction
13 url
9 user
10 user_agent

@jsoriano
Copy link
Member Author

@andrewkroh thanks for the analysis. I agree that it can make sense to have so many fields in some cases, but I wonder if we could do something to avoid them being indexed.

How are these netflow.* fields used? Could it be an option to convert the netflow object to the flattened type? Fields would lose their data types (they would be like keywords), but maybe this can be addressed with runtime fields if needed. We would need to add support for runtime fields though.

In any case if we think that a limit of 2048 fields is not so bad, probably we can go on with this.

I would be cautious about adding ways to circumvent this limit, because it could produce packages with way too many fields once we open development to more teams.

@ruflin
Copy link
Member

ruflin commented Mar 18, 2022

Do we need to index all these fields? My assumption is, the majority of it will not show up in all events. Would it be possible to only index a small portion and for the rest either use runtime fields in the mappings or on query time? Which of these fields do we use for the dashboards? https:/elastic/package-storage/tree/production/packages/netflow/1.4.0/kibana/dashboard

Having the majority as runtime fields would still keep the template itself large but I assume storage wise it would be more efficient. @jpountz Is there a limit / recommendation on the max number of runtime fields someone should use for a signel data stream / mapping?

If we go down that route, we could have 2 limits: 1 for index fields and one for non index fields.

@aleksmaus
Copy link
Member

Osquery Manager @melissaburpo @aleksmaus

It's not clear how we can trim the number of mapped fields for osquery.
Osquery has 279 tables with dozens of columns that can be queried with any possible query the user can come up with.
We don't know in advance what fields users will be searching for.
We are open for suggestions, what other options are available hopefully without degrading the already supported functionality.

@ruflin
Copy link
Member

ruflin commented Mar 18, 2022

When you mention 279 tables, it sounds like there already exists some grouping. What kind of queries are run? Is there normally 1 query per table? If yes, should we have 1 data stream per table (seem extreme too). What about the data retention, is it all the same across all fields / data structures? On ingestion, is one event always going into a single "table" meaning there are 279 different event types?

What about runtime fields, is this an option?

@aleksmaus
Copy link
Member

We didn't put any restrictions, it can be any kind of queries, including subqueries, joins of any complexity.
Example:
https://fleetdm.com/queries/detect-active-processes-with-log-4-j-running
There is no reliable way to detect which column belongs to which table in the result.
The mapping is the flat union of all the columns across the tables, to the lowest common/compatible datatype.

What would be the runtime fields experience in the kibana? Can the users filter/sort/dissect the received data from the life queries right away with existing kibana UX?

@ruflin
Copy link
Member

ruflin commented Mar 18, 2022

I had a conversation with @aleksmaus around how exactly the results data is queried in Elasticsearch itself. Results are grouped by query id and normally we have a few 100 to max few 1000 results (rows, ES docs). This means, runtime fields could work great in this use case. The data from Elasticsearch is prefiltered on the query id and then runtime fields are used for running query on the resulting data. If we use runtime fields as part of the mappings, I would expect users would still have exactly the same experience in Elasticsearch or Kibana (for example Lens), but something that should be verified.

@jsoriano @mtojek Are runtime fields today supported in the package spec for mappings?

@andrewkroh
Copy link
Member

Regarding Netflow, while there are many possible fields, in my experience only a small subset are used based on the vendor sending data. My Ubiquiti router, for example, populates 26 of the netflow.* fields. Aside from a large mapping, do these extra unused fields create larger indices (I thought they did not)? If that's the case we could recommend to users that they use separate data streams for different vendors (via namespaces) to avoid sparsity issues caused by each vendor using a different subset of netflow fields.

How are these netflow.* fields used? Could it be an option to convert the netflow object to the flattened type?

This data is mostly metrics about a network flows. There are numbers, IPs, dates, and keywords. Only about 17% are keywords. I don't think flattened would be a good use case for the metrics data. We could possibly lump all of the keyword fields into a single flattened field.

   4 "boolean"
   8 "float"
  12 "double"
  27 "date"
  66 "ip"
 165 "integer"
 197 "short"
 231 "keyword"
 612 "long"

Do we need to index all these fields?

Probably not, but it's hard to say what metrics a user will rely on heavily in order to choose the fields to index.

@jsoriano
Copy link
Member Author

@jsoriano @mtojek Are runtime fields today supported in the package spec for mappings?

They aren't supported yet, we have this issue elastic/package-spec#39, we could prioritize it if we find that this could solve this kind of issues.

It'd be good to know though if it is better to have many unused runtime fields than many unused mappings.

If that's the case we could recommend to users that they use separate data streams for different vendors (via namespaces) to avoid sparsity issues caused by each vendor using a different subset of netflow fields.

This could be also a good strategy if having many mappings is not a problem by itself.

@jpountz
Copy link

jpountz commented Mar 18, 2022

@jpountz Is there a limit / recommendation on the max number of runtime fields someone should use for a signel data stream / mapping?

Runtime fields help by not impacting storage with high numbers of fields, but they still put overhead on things like the cluster state. My gut feeling is that it's not the right answer to this problem. Based on data points on this issue, it looks like some of the fields never get populated. I'd be interested in seeing whether we could get these fields to never be mapped in the first place?

Regarding Netflow, while there are many possible fields, in my experience only a small subset are used based on the vendor sending data. My Ubiquiti router, for example, populates 26 of the netflow.* fields. Aside from a large mapping, do these extra unused fields create larger indices (I thought they did not)?

Unused fields do not make indices larger, actually Lucene never learns about fields that exist in mappings but not in documents, only Elasticsearch knows about these fields.

My first intuition is that handling data where the set of fields that are actually used in practice is not known in advance is a good fit for either flattened when the set of fields is unbounded, or dynamic mappings when we know that there can only be so many different fields. In the case of the Netflow integration and its netflow.* fields, could be rely on dynamic mappings (possibly with a few dynamic mapping rules for specific types like IP addresses)?

@andrewkroh
Copy link
Member

In the case of the Netflow integration and its netflow.* fields, could be rely on dynamic mappings (possibly with a few dynamic mapping rules for specific types like IP addresses)?

That sounds like a good approach.

I would keep the mappings in place for the float/double/date/ip fields and rely on dynamic for all the other netflow.* fields. The reason for keeping float/double is to prevent getting the mapping wrong in case the first value happens to be a 0 or some non-floating point value. The reason for date is that date_detection is turned off by Fleet. And ip is because I don't see a reliable path_match rule to apply. If we were willing to cause a breaking change we could rename some fields to establishing a naming convention that makes it trivial to apply path_match rules. (That's something to keep in mind for new integration development.)

  8 "float"
  12 "double"
  27 "date"
+ 66 "ip"
-----------
  113 netflow fields
+ 425 non-netflow fields
-----------
  538 total fields

@ruflin
Copy link
Member

ruflin commented Mar 21, 2022

On the IP side, it would be nice if ES would have a feature to "detect" IP addresses based on the pattern of these fields.

@andrewkroh For the naming convention, are you thinking of something like .ip or _ip for matching? You mention the 425 non-netflow fields? What are these?

@andrewkroh
Copy link
Member

@andrewkroh For the naming convention, are you thinking of something like .ip or _ip for matching?

Yes, I was thinking of a suffix based on data type like _ip. But I'm not planning any changes now because it would be a breaking change.

You mention the 425 non-netflow fields? What are these?

This are the fields I mention in #2839 (comment). They are mostly ECS fields and a few Filebeat fields. And as I mentioned there, I suspect that many of those are unused and that list of ECS fields can be drastically pruned if we do a thorough analysis.

@jpountz
Copy link

jpountz commented Mar 21, 2022

On the IP side, it would be nice if ES would have a feature to "detect" IP addresses based on the pattern of these fields.

We could do something like that. We'd just need to be careful that IP addresses can be very short strings like ::1 so we should make sure that none of the non-IP fields could ever take something that looks like an IP address.

Out of curiosity, is the type known on the agent side? If so, agent could send dynamic mappings as part of the bulk request to help Elasticsearch make the right decision?

Separately, I've been looking at the fields of the Netflow integration, and it looks like we're always using different field names for IPv4 and IPv6 addresses, e.g. netflow.post_nat_source_ipv4_address and netflow.post_nat_source_ipv6_address. It makes it hard to e.g. compute top IP addresses across both fields. I guess we're doing this to reflect fields that are populated by the Netflow integration, but it likely makes it harder to analyze the data compared to if both IPv4 and IPv6 were stored in the same field.

@andrewkroh
Copy link
Member

it looks like we're always using different field names for IPv4 and IPv6 addresses, e.g. netflow.post_nat_source_ipv4_address and netflow.post_nat_source_ipv6_address. . . I guess we're doing this to reflect fields that are populated by the Netflow integration

Rather than modify the original netflow data, the Filebeat input passes through the fields with minimal changes. This includes keeping the original field names (with a change the snake case).

So post_nat_source_ipv4_address maps to IPFIX postNATSourceIPv4Address. If you want a normalized field to aggregate on then you could use ECS source.nat.ip (I think that's what both post_nat_source_ipv4_address and post_nat_source_ipv6_address map to).

@ruflin
Copy link
Member

ruflin commented Mar 23, 2022

For the ip fields, could we match on *IPv4*, *IPv6* or similar? Does ES support something like this?

@ruflin
Copy link
Member

ruflin commented Mar 23, 2022

Linking to elastic/kibana#128152 here as "too many fields" in a data stream also has effects on query.default_field.

@jpountz
Copy link

jpountz commented Mar 23, 2022

Elasticsearch does support matching field names using wildcards, but it looks like it wouldn't work since some fields have ipv4 in their names but they are not IP addresses, e.g. netflow.destination_ipv4_prefix_length (short).

@pzl
Copy link
Member

pzl commented Apr 6, 2022

Wanted to add some discussion here about what to do in the endpoint package. Most of our data stream counts are under control:

Data Stream Field count
action_responses 35
actions 37
alerts 1456
collection 20
file 171
library 161
metadata 54
metrics 107
network 153
policy 104
process 345
registry 114
security 107

With alerts being the outsized problem here.

By namespace:

Count Namespace
1 ecs
1 Events
1 message
1 @timestamp
2 dns
2 elastic
3 data_stream
3 registry
5 agent
6 Memory_protection
7 group
10 rule
11 Responses
12 destination
12 source
16 event
17 Endpoint
17 user
30 Ransomware
38 dll
47 host
92 file
309 Target
389 process
424 threat

I'm not sure what a good reduction strategy would be here.

@mtojek
Copy link
Contributor

mtojek commented Apr 6, 2022

@pzl Regarding the "alerts" datastream, do you use runtime fields or standard ones?

@pzl
Copy link
Member

pzl commented Apr 6, 2022

Only standard I believe. The fields.yml is here

@mtojek
Copy link
Contributor

mtojek commented Apr 6, 2022

I admit that I don't know the domain and you're experts there, but it's hard to believe that we need so many fields there. What are the typical queries for this data? Similar question as here: #2839 (comment)

@ruflin
Copy link
Member

ruflin commented Apr 7, 2022

@pzl Who creates the alerts data. Is this shipped by the endpoint binary?

@pzl
Copy link
Member

pzl commented Apr 7, 2022

Yes, this is for data sent by the endpoint binary. I am reaching out to find a person who can speak to the particular uses of the alerts index

@kevinlog
Copy link

@ruflin @mtojek I met with some stakeholders on the Security side regarding our usage of the the alerts datastream.

Currently, this index sits at 1456 mapped fields. The alerts datastream is the one that the Endpoint Security binary streams all alerts documents to which represents potential threats on our users' remote machines. For instance, if a malicious file attempts the run a host, the Endpoint will detect it and then send an Alert document to ES to notify the Security user/analyst that a machine on their network was attacked. Alerts are the most important type of data that the Security Endpoint streams to ES and is the heart of most use cases in an Endpoint Detect Response product.

There are about 5 different alert types that can be streamed to this datastream. Many mapped fields overlap, however there are also several sets of mapped fields that would only be streamed for a particular alert type. In addition, depending on the details of a potential attack, even alerts of the same type could stream different fields. This is the reason for so many mapped fields. It is also important to note that a single Alert document will certainly not contain a majority of these fields at once, but the variability of Alerts is the driver for the number of fields. @magermark could potentially give a rough estimate of the number of fields Alerts docs have on average, if that's helpful.

We're apprehensive to prune too many fields because we don't want to limit users on how they can search, build rules, and visualize their alerts data. There are certainly fields we could remove from mappings, but we're likely to add additional fields in the future, so while we could potentially reduce the number to ~1024, it would be a fairly temporary fix as we add more features to Alerts.

Some options we discussed:

  • Manually prune fields that we don't think need to be mapped
    • As stated above, we could probably get the number down low enough, but it could grow again as we add new features to Alerts
  • Add an exception field to the package-spec to allow a datastream to have more than 1024 mapped fields in a package.
    • This is an Elastic maintained package, maybe it's OK to allow exceptions for those, but not external packages?
    • I see this is mentioned in the issue description, Alerts may be a good candidate for this.
  • Split the Alerts datastream into new datastreams representing individual alert types.
    • This could spread the mappings across different datastreams
    • This causes backwards compatibility issues since we need to support older Endpoint binaries that stream only to the existing Alerts datastream. Because of this, I'm not sure we can eliminate the original Alerts datastream initially anyway, so we'd still have one with many mapped fields.
    • In any case, I wouldn't consider this a short term fix. It could be a few release cycles before we had bandwidth to do this

Happy to explore other options. Perhaps we could make more use of dynamic mappings or flattened objects, but I would need to understand more about them before decided which fields to replace.

cc @magermark @joe-desimone @pzl @ferullo

@ruflin
Copy link
Member

ruflin commented Apr 14, 2022

Does the endpoint binary know about the type of each field? Or does endpoint rely on Elasticsearch to define the mapping? I'm asking this because one option could be that the field type is shipped as part of the event instead of having the mapping set in advance. But likely this would cause problems on "too many permissions".

One of the key points above is that the number of fields is fixed and known in advance. It sounds like the number could still increase but it is not like it will double just in a few weeks. Having that many fields is not necessarily a problem but often it indicates a smell. But for the alerts scenario, I'm ok if we increase the limit.

My general concern is that if we introduce this flag, other data streams will just enable it without having the detailed discussion we had here.

Is using runtime fields instead of mapped fields an option? What is the number of alerts that max exist in such a data stream that need to be crawled?

@botelastic
Copy link

botelastic bot commented Apr 14, 2023

Hi! We just realized that we haven't looked into this issue in a while. We're sorry! We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1. Thank you for your contribution!

@botelastic botelastic bot added the Stalled label Apr 14, 2023
@botelastic botelastic bot closed this as completed Oct 11, 2023
@zez3
Copy link

zez3 commented Oct 11, 2023

:(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Stalled Team:Ecosystem Label for the Packages Ecosystem team [elastic/ecosystem]
Projects
None yet
Development

No branches or pull requests

9 participants