-
Notifications
You must be signed in to change notification settings - Fork 429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discuss strategies to reduce number of fields #2839
Comments
Let's ping folks behind these packages: Netflow @andrewkroh @marc-gr |
It would be helpful to break down the contents of the netflow integrations fields so I've included a table for reference (also I looked at the all the data streams https://gist.github.com/andrewkroh/885e28b1cdafbeacf0fca20b062a6de2). There are 1322 Looking at this list, I suspect that a small number of the included ECS fields are unused. Like I'm open to ideas, but I don't see a good way to drastically reduce the field count for netflow.
|
@andrewkroh thanks for the analysis. I agree that it can make sense to have so many fields in some cases, but I wonder if we could do something to avoid them being indexed. How are these In any case if we think that a limit of 2048 fields is not so bad, probably we can go on with this. I would be cautious about adding ways to circumvent this limit, because it could produce packages with way too many fields once we open development to more teams. |
Do we need to index all these fields? My assumption is, the majority of it will not show up in all events. Would it be possible to only index a small portion and for the rest either use runtime fields in the mappings or on query time? Which of these fields do we use for the dashboards? https:/elastic/package-storage/tree/production/packages/netflow/1.4.0/kibana/dashboard Having the majority as runtime fields would still keep the template itself large but I assume storage wise it would be more efficient. @jpountz Is there a limit / recommendation on the max number of runtime fields someone should use for a signel data stream / mapping? If we go down that route, we could have 2 limits: 1 for index fields and one for non index fields. |
It's not clear how we can trim the number of mapped fields for osquery. |
When you mention 279 tables, it sounds like there already exists some grouping. What kind of queries are run? Is there normally 1 query per table? If yes, should we have 1 data stream per table (seem extreme too). What about the data retention, is it all the same across all fields / data structures? On ingestion, is one event always going into a single "table" meaning there are 279 different event types? What about runtime fields, is this an option? |
We didn't put any restrictions, it can be any kind of queries, including subqueries, joins of any complexity. What would be the runtime fields experience in the kibana? Can the users filter/sort/dissect the received data from the life queries right away with existing kibana UX? |
I had a conversation with @aleksmaus around how exactly the results data is queried in Elasticsearch itself. Results are grouped by query id and normally we have a few 100 to max few 1000 results (rows, ES docs). This means, runtime fields could work great in this use case. The data from Elasticsearch is prefiltered on the query id and then runtime fields are used for running query on the resulting data. If we use runtime fields as part of the mappings, I would expect users would still have exactly the same experience in Elasticsearch or Kibana (for example Lens), but something that should be verified. @jsoriano @mtojek Are runtime fields today supported in the package spec for mappings? |
Regarding Netflow, while there are many possible fields, in my experience only a small subset are used based on the vendor sending data. My Ubiquiti router, for example, populates 26 of the
This data is mostly metrics about a network flows. There are numbers, IPs, dates, and keywords. Only about 17% are keywords. I don't think
Probably not, but it's hard to say what metrics a user will rely on heavily in order to choose the fields to index. |
They aren't supported yet, we have this issue elastic/package-spec#39, we could prioritize it if we find that this could solve this kind of issues. It'd be good to know though if it is better to have many unused runtime fields than many unused mappings.
This could be also a good strategy if having many mappings is not a problem by itself. |
Runtime fields help by not impacting storage with high numbers of fields, but they still put overhead on things like the cluster state. My gut feeling is that it's not the right answer to this problem. Based on data points on this issue, it looks like some of the fields never get populated. I'd be interested in seeing whether we could get these fields to never be mapped in the first place?
Unused fields do not make indices larger, actually Lucene never learns about fields that exist in mappings but not in documents, only Elasticsearch knows about these fields. My first intuition is that handling data where the set of fields that are actually used in practice is not known in advance is a good fit for either |
That sounds like a good approach. I would keep the mappings in place for the float/double/date/ip fields and rely on dynamic for all the other
|
On the IP side, it would be nice if ES would have a feature to "detect" IP addresses based on the pattern of these fields. @andrewkroh For the naming convention, are you thinking of something like |
Yes, I was thinking of a suffix based on data type like
This are the fields I mention in #2839 (comment). They are mostly ECS fields and a few Filebeat fields. And as I mentioned there, I suspect that many of those are unused and that list of ECS fields can be drastically pruned if we do a thorough analysis. |
We could do something like that. We'd just need to be careful that IP addresses can be very short strings like Out of curiosity, is the type known on the agent side? If so, agent could send dynamic mappings as part of the bulk request to help Elasticsearch make the right decision? Separately, I've been looking at the fields of the Netflow integration, and it looks like we're always using different field names for IPv4 and IPv6 addresses, e.g. |
Rather than modify the original netflow data, the Filebeat input passes through the fields with minimal changes. This includes keeping the original field names (with a change the snake case). So |
For the ip fields, could we match on |
Linking to elastic/kibana#128152 here as "too many fields" in a data stream also has effects on |
Elasticsearch does support matching field names using wildcards, but it looks like it wouldn't work since some fields have |
Wanted to add some discussion here about what to do in the
With By namespace:
I'm not sure what a good reduction strategy would be here. |
@pzl Regarding the "alerts" datastream, do you use runtime fields or standard ones? |
Only standard I believe. The fields.yml is here |
I admit that I don't know the domain and you're experts there, but it's hard to believe that we need so many fields there. What are the typical queries for this data? Similar question as here: #2839 (comment) |
@pzl Who creates the alerts data. Is this shipped by the endpoint binary? |
Yes, this is for data sent by the endpoint binary. I am reaching out to find a person who can speak to the particular uses of the |
@ruflin @mtojek I met with some stakeholders on the Security side regarding our usage of the the Currently, this index sits at There are about 5 different alert types that can be streamed to this datastream. Many mapped fields overlap, however there are also several sets of mapped fields that would only be streamed for a particular alert type. In addition, depending on the details of a potential attack, even alerts of the same type could stream different fields. This is the reason for so many mapped fields. It is also important to note that a single Alert document will certainly not contain a majority of these fields at once, but the variability of Alerts is the driver for the number of fields. @magermark could potentially give a rough estimate of the number of fields Alerts docs have on average, if that's helpful. We're apprehensive to prune too many fields because we don't want to limit users on how they can search, build rules, and visualize their alerts data. There are certainly fields we could remove from mappings, but we're likely to add additional fields in the future, so while we could potentially reduce the number to ~1024, it would be a fairly temporary fix as we add more features to Alerts. Some options we discussed:
Happy to explore other options. Perhaps we could make more use of dynamic mappings or flattened objects, but I would need to understand more about them before decided which fields to replace. |
Does the endpoint binary know about the type of each field? Or does endpoint rely on Elasticsearch to define the mapping? I'm asking this because one option could be that the field type is shipped as part of the event instead of having the mapping set in advance. But likely this would cause problems on "too many permissions". One of the key points above is that the number of fields is fixed and known in advance. It sounds like the number could still increase but it is not like it will double just in a few weeks. Having that many fields is not necessarily a problem but often it indicates a smell. But for the alerts scenario, I'm ok if we increase the limit. My general concern is that if we introduce this flag, other data streams will just enable it without having the detailed discussion we had here. Is using runtime fields instead of mapped fields an option? What is the number of alerts that max exist in such a data stream that need to be crawled? |
Hi! We just realized that we haven't looked into this issue in a while. We're sorry! We're labeling this issue as |
:( |
There are some packages in this repository with too many fields on their data streams. Having too many fields can lead to performance degradation and is usually discouraged (see related ES docs).
On elastic/package-spec#278 we introduced a limit of 1024 fields per datastream, but we had to increase it to 2048 in elastic/package-spec#294 to keep builds green for some packages. We would like to be able to have a lower limit, at least by default.
Some questions to discuss:
The packages with more than 1024 fields are:
cc @ruflin @mtojek
The text was updated successfully, but these errors were encountered: