Provide a bulk API for creating ingest assets #77505

joshdover · 2021-09-09T14:35:29Z

When Fleet installs Elasticsearch ingest assets (index and component templates, ingest pipelines, ILM policies, etc.) for a package, we're currently bottlenecked by queueing behavior on cluster state updates as observed in this issue: elastic/kibana#110500 (comment)

This is causing some package installs to take upwards of 30s. This is a problem for Fleet, Kibana, and Elastic Agent for two primary reasons:

We need the ability to upgrade packages on Kibana upgrades to keep some ingest assets in sync with the rest of the Stack (eg. assets used by APM Server or Elastic Agents themselves for monitoring).
We also likely will want the ability to automatically downgrade packages and reinstall older version of assets when there was an issue with a Kibana upgrade that requires a rollback to the previous Kibana version. This would require that we re-write all ingest assets in Elasticsearch to be sure they're compatible with the older Kibana version.

For both of these use cases, if this process is slow, Kibana upgrades and rollbacks will be too slow and possibly time out depending on the configuration of the orchestration layer.

When executing Fleet's setup process which installs the system package, we're seeing cluster state updates take ~150ms each on a single node cluster running on the same machine as Kibana. See the node stats results taken here before and after the setup process: node_stats.zip, es_logs.zip

@DaveCTurner mentioned that one way we could optimize this is by providing a bulk API to batch these cluster state updates in a single write.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-09-09T14:35:32Z

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner · 2021-09-09T14:39:44Z

There's a question about why cluster state updates take ~150ms in these tests, and that falls under the :Distributed/Cluster coordination label, but the question about a bulk API for installing templates/pipelines/ILM policies etc is the domain of the data management team so I'm moving this over there.

elasticmachine · 2021-09-09T14:39:57Z

Pinging @elastic/es-data-management (Team:Data Management)

dakrone · 2021-09-09T16:26:26Z

@joshdover out of curiosity, do you have an idea of the number of items Fleet usually would want to do in a single request? 10s? 100s? 1000s?

dakrone · 2021-09-09T17:21:34Z

Also, rather than batching cluster state updates, perhaps it would be better to teach Elasticsearch the concept of a package, with a set of templates, policies, pipelines, and metadata, and then make adding or removing a package an atomic operation from a cluster-state perspective.

jakelandis · 2021-09-09T21:21:25Z

perhaps it would be better to teach Elasticsearch the concept of a package

That is an interesting idea, and is tangentially related to #63798. However, that request could be interpreted as a single package that would contribute to something that is shared among all installed packages. I am not sure if that request is actually related, but if we pursue a package-based approach we should probably consider things that a package may want to install that are not unique to only that package. (for example maybe multiple packages re-use ingest pipelines, we would need a way to keep them from stomping on each other)

dakrone · 2021-09-09T21:26:36Z

we should probably consider things that a package may want to install that are not unique to only that package. (for example maybe multiple packages re-use ingest pipelines, we would need a way to keep them from stomping on each other)

I think we'd probably need both, because we will probably want the concept of "global" settings/items, and we will also want to be able to have two packages that initially shared a "thing", but when a change is made to that thing, the change is made only to a single package (i.e. namespacing of customization).

joshdover · 2021-09-10T10:41:00Z

out of curiosity, do you have an idea of the number of items Fleet usually would want to do in a single request? 10s? 100s? 1000s?

For initial setup, we're looking at 100s of objects but in the future as more packages are being upgraded we could potentially want to upgrade 1000s of objects at once. That said, we'll likely want to do these in one request per package, so that in case any of these operations fail we can isolate the failure to a single ingest integration.

joshdover · 2021-09-10T10:44:22Z

perhaps it would be better to teach Elasticsearch the concept of a package

This is the long term plan and I agree it's something we'll need at some point. My thinking was that maybe it'd be simpler to start with a bulk API that could later be used under the hood for a more complete package API abstraction in the future. That way we can more immediately solve the problems we're seeing now as we work out the many details on what a package API would need to support. I'll defer to you folks on what makes sense here.

joshdover · 2022-01-06T12:02:00Z

Package install and upgrade performance continues to be a challenge for both Kibana reliability and user onboarding. I want to highlight the recent changes in these areas and how this challenge affects the user and operator experience.

User onboarding

One user experience change that is targeting to ship in 8.1 is the removal of installing default packages (elastic/kibana#108456). This will move the package installation step for key required packages (fleet_server, elastic_agent, system) to happen during the onboarding process when a user sets up their first Agent.

By moving this installation step to the onboarding flow, we're adding a very significant ~30s+ delay to a key step in the flow which must happen before the user is instructed to actually install their first agent. We have concerns that this delay may have a negative impact on the success rate of users getting started with the Stack.

This is primarily or entirely bottlenecked by the performance of creating ingest assets in Elasticsearch.

Kibana reliability

As of elastic/kibana#111858 which is shipping in 8.0, Kibana will be installing and upgrading 1st party packages (Endpoint, APM, Synthetics, etc.) on boot. In this initial version, this process does not block Kibana startup and instead runs as an asynchronous upgrade process. This asynchronous upgrade process is not ideal as it may give operators a false sense that the Kibana upgrade has completed and it's safe to start upgrading Fleet Server, Elastic Agent, or standalone APM Server. If these components are upgraded before packages have completed upgrade, ingest could break resulting in dropped data or data could be ingested in a format that is unusable by application UIs or dashboards.

One of the reasons we are hesitant to block Kibana startup (elastic/kibana#120616) is the slow installation process which is primarily bottlenecked by this issue.

It's worth noting that package upgrades are currently implemented as full removal and then subsequent installation. This means that we'd need to be able to both delete and create ingest assets quickly in Elasticsearch. If increasing the scope of this bulk create API to also support deletes is a major challenge but supporting updates is not, it's possible we could revisit the upgrade logic in Fleet to minimize the deletes we do and leverage the bulk create/update logic.

cc @jakelandis @dakrone

joshdover · 2022-04-07T11:05:07Z

@jakelandis You asked for some addition metrics and reproduction steps here. Could you clarify what would be helpful to provide aside from what I provided here: elastic/kibana#110500 (comment)

This can easily be reproduced by:

Configure Kibana to send APM data to a cluster of your choosing, by setting these env vars:

ELASTIC_APM_ACTIVE=true
ELASTIC_APM_SERVER_URL=https://myapmendpoint.com/
ELASTIC_APM_SECRET_TOKEN=foo

Start ES and Kibana
Run this API call against Kibana:

curl -XPOST -H 'content-type: application/json' -H 'kbn-xsrf: foo' -u elastic:changeme http://localhost:5601/api/fleet/epm/packages/system/1.6.4

View trace in APM on the /api/fleet/epm/packages endpoint

If this isn't enough to go on, I think this could easily be emulated by trying to do many PUT calls on index templates and ingest pipelines in parallel using concurrently in the shell.

jakelandis · 2022-04-07T13:23:58Z

I think this could easily be emulated by trying to do many PUT calls on index templates and ingest pipelines in parallel using concurrently in the shell.

Yes this would help to be able to isolate the issue. Do you have any example index templates and ingest pipelines we can use to test ? How much concurrency do you have ? i.e. a dozen concurrent requests for a mix of templates and pipelines or just 2 concurrent with different lanes for pipelines and templates ? What is the cluster setup ? (a single node hosted locally?)

Anything information you can provide that allows us to reproduce this without Fleet, but based closely to Fleet's usage would greatly help us to identify the slow down.

dakrone · 2022-04-08T02:48:49Z

I can think of at least one (hopefully quick) thing that may help this without any additional API overhead—we could change the cluster state updates for these to be batched (currently neither the templates nor ingest pipelines are batched). That would only really help if multiple things of the same type were being installed in parallel, however. Judging by the issue Josh linked where they were experimenting with both, I think it could help the parallel case.

(although none of this is backed up by numbers, and we'd want to have a steadily reproducible way to test this, as Jake mentioned above)

joshdover · 2022-04-11T13:59:45Z

I'll create an easy repro example later this week. In the meantime, I can discuss how the parallelism works.

Multiple packages can be installed at once, in parallel. This is Node.js so, there's no limit the number of async requests we execute at once. I've also confirmed we're not hitting any connection cap in the client, but that was months ago and it should be revalidated.
For each package, each asset type is currently installed in serial, in this order:
- Kibana Saved Objects
- ILM policies
- ML models
- Ingest pipelines (new ones created)
- Index and component templates
- Data streams are rolled over
- Transforms
- Ingest pipelines (old ones deleted)
Within each asset type, each individual asset is done in parallel, except in cases where there's a dependency (component templates are created before the index template that references them).

I've experimented with making each asset type created in parallel when possible, and this did not improve the performance at all, just more async requests waiting at once.

The bulk of the code for this lives here: https:/elastic/kibana/blob/66b3f01a17dbcbb35fdf47ea439b8dd8666ae249/x-pack/plugins/fleet/server/services/epm/packages/_install_package.ts/#L117

joshdover · 2022-04-14T10:59:13Z

I've created an emulation script that attempts to emulate the parallelism we do in package installation when installing the system package. This package only contains Saved Objects (not included in my script), ingest pipelines, component templates, and index templates. You'll see that the ingest part of the package installation (everything except the SOs) takes ~6.5s when running raw like this. With @jakelandis's ES build that includes timing logs on these endpoints, I noticed component templates that were taking >2.5 seconds to be created.

To use my script:

Start ES with changeme as the elastic password
Unzip the archive
Run ./run.sh, observe timings
Run ./teardown.sh - note this doesn't emulate our parallelism during uninstalls, it's just a convenience script for re-running the test without having to start a clean ES

package-install-repro.zip

Branch with the (hacky) code I used to generate this script while running a package install: https:/joshdover/kibana/tree/fleet/install-repro.

Here's a related APM trace from the real package install code in Kibana that shows similar behavior:

dakrone · 2022-04-19T20:49:18Z

Thanks @joshdover, this was useful! I ran some local tests with your code. On my local laptop I was at ~6s for installing all the operations, and ~4s for installing them all on my desktop machine. Interestingly, even with a single ES node, you can see the nanosecond timings for PUT-ing a component template keep increasing as more and more are done in parallel:

component template time to install NON-batching (in nanoseconds):
193570000
103017000
97592875
179529792
292973166
392919291
469226625
551573167
683457458
761124667
840760875
980525000
1057703666
1137113667
1316617209
1401967667
1513343958
1714177875
1887339000
1976050292
2059373333
2185087958
2303293375
2486336042
2571670625
2652833625
2773985833
2856030209
2936203417
3076151084
3161950541
3243575167
3382663917
3484927917
3569993167
3644332375

I did a quick-and-dirty batching implementation for all the template stuff as well as ingest pipelines. That brought the real time for the reproduction script to ~1.2s on my desktop (so ~4s => ~1.2s), which seems like a pretty good improvement considering that there isn't even any network overhead for the cluster state updates. The nanosecond timings for the component templates even out over time also, if you compare the timings:

component template time to install WITH-batching (in nanoseconds):
37369903
39976718
69583654
70282912
300642897
300601008
300733378
301792195
302690829
302738018
303127592
301654274
301658623
303753222
300833978
299771164
299143169
301040587
297771062
297805007
298434795
296777609
300011938
296849525
296877517
296880223
297054001
297145763
299669202
301940534
297097322
301204276
297407938
297501865
303229554
297697925

You can see how the timing reaches a steady equilibrium of ~0.3 seconds to install each component template.

I've opened up a WIP PR with my changes at #86017, and here are some custom 8.3.0-SNAPSHOT builds that include the changes from that PR as well as some timing output from Jake:

Could you try your tests with these builds and see if this is enough for the short term to alleviate this problem? (Feel free to generate your own ES build from my PR, I wasn't sure whether that was something you wanted to do, hence the custom builds; I'd try the Kibana reproduction but I have never been able to figure out how to build Kibana locally). If this seems promising to you I can work on getting that PR polished up and merged in.

jen-huang · 2022-04-20T17:17:58Z

@dakrone Awesome improvements. QQ, I see your second snippet has the note install WITH-batching - does this refer to batching on the ES side, or are you recommending for Fleet to batch our requests to ES?

dakrone · 2022-04-20T20:04:03Z

does this refer to batching on the ES side, or are you recommending for Fleet to batch our requests to ES?

All of these timings use the reproduction script, which sends requests in parallel to Elasticsearch. The batching I mentioned on the ES side is batching cluster state update tasks (which occur when the pipeline and templates are created).

joshdover · 2022-04-25T15:56:11Z

Thanks @dakrone. With your changes alone I see an improvement when installing the system package (same package from the repro case I used) from 22.9s to 12.0s, a drop of 48% of the total package installation time.

With a few additional improvements on the Fleet side (elastic/kibana#130906) I was able to optimize this further, down to 8.0s on my local machine, totaling an improvement of 65% which is nearly 3x as fast. I believe there is likely another win to make on the Fleet side to shave off another 1-2s.

I think we should definitely move forward on your PR. 🎉

dakrone · 2022-04-25T17:21:17Z

Cool, thanks @joshdover, I'll work on getting the PR in.

This commit changes the cluster state operations for templates (legacy, component, and composable) as well as ingest pipelines to be bulk executed. This means that they can be processed much faster when creating/updating many simultaneously. Relates to #77505

joshdover · 2022-05-04T16:10:11Z

With the batching changes in and the improvements I’ve been able to make on the Kibana side, I think we can close this issue for now. I am still seeing some related slowness around creating ES transforms, but that is a separate problem that can be evaluated independently.

Thanks all, @dakrone and @jakelandis

joshdover added >enhancement :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. needs:triage Requires assignment of a team area label labels Sep 9, 2021

elasticmachine added the Team:Distributed Meta label for distributed team label Sep 9, 2021

DaveCTurner added the :Data Management/Other label Sep 9, 2021

elasticmachine added the Team:Data Management Meta label for data/management team label Sep 9, 2021

dakrone mentioned this issue Dec 7, 2021

Reduce default template logging at startup #81398

Open

DaveCTurner mentioned this issue Dec 7, 2021

Reduce default ILM logging at startup #81397

Open

joshdover mentioned this issue Dec 13, 2021

[Fleet] Changes in package install format should be applied on Stack upgrades elastic/kibana#121099

Closed

5 tasks

This was referenced Jan 4, 2022

[Fleet] Make default integration install explicit elastic/kibana#121628

Merged

[Fleet] Block Kibana startup for Fleet setup completion elastic/kibana#120616

Open

dakrone mentioned this issue Apr 26, 2022

Batch execute template and pipeline cluster state operations #86017

Merged

joshdover closed this as not planned Won't fix, can't repro, duplicate, stale May 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide a bulk API for creating ingest assets #77505

Provide a bulk API for creating ingest assets #77505

joshdover commented Sep 9, 2021 •

edited

Loading

elasticmachine commented Sep 9, 2021

DaveCTurner commented Sep 9, 2021

elasticmachine commented Sep 9, 2021

dakrone commented Sep 9, 2021

dakrone commented Sep 9, 2021

jakelandis commented Sep 9, 2021

dakrone commented Sep 9, 2021

joshdover commented Sep 10, 2021

joshdover commented Sep 10, 2021

joshdover commented Jan 6, 2022 •

edited

Loading

joshdover commented Apr 7, 2022 •

edited

Loading

jakelandis commented Apr 7, 2022

dakrone commented Apr 8, 2022 •

edited

Loading

joshdover commented Apr 11, 2022 •

edited

Loading

joshdover commented Apr 14, 2022 •

edited

Loading

dakrone commented Apr 19, 2022 •

edited

Loading

jen-huang commented Apr 20, 2022

dakrone commented Apr 20, 2022

joshdover commented Apr 25, 2022

dakrone commented Apr 25, 2022

joshdover commented May 4, 2022

Provide a bulk API for creating ingest assets #77505

Provide a bulk API for creating ingest assets #77505

Comments

joshdover commented Sep 9, 2021 • edited Loading

elasticmachine commented Sep 9, 2021

DaveCTurner commented Sep 9, 2021

elasticmachine commented Sep 9, 2021

dakrone commented Sep 9, 2021

dakrone commented Sep 9, 2021

jakelandis commented Sep 9, 2021

dakrone commented Sep 9, 2021

joshdover commented Sep 10, 2021

joshdover commented Sep 10, 2021

joshdover commented Jan 6, 2022 • edited Loading

User onboarding

Kibana reliability

joshdover commented Apr 7, 2022 • edited Loading

jakelandis commented Apr 7, 2022

dakrone commented Apr 8, 2022 • edited Loading

joshdover commented Apr 11, 2022 • edited Loading

joshdover commented Apr 14, 2022 • edited Loading

dakrone commented Apr 19, 2022 • edited Loading

jen-huang commented Apr 20, 2022

dakrone commented Apr 20, 2022

joshdover commented Apr 25, 2022

dakrone commented Apr 25, 2022

joshdover commented May 4, 2022

joshdover commented Sep 9, 2021 •

edited

Loading

joshdover commented Jan 6, 2022 •

edited

Loading

joshdover commented Apr 7, 2022 •

edited

Loading

dakrone commented Apr 8, 2022 •

edited

Loading

joshdover commented Apr 11, 2022 •

edited

Loading

joshdover commented Apr 14, 2022 •

edited

Loading

dakrone commented Apr 19, 2022 •

edited

Loading