Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating resources on a stack upgrade should be easier #103841

Closed
dgieselaar opened this issue Jun 30, 2021 · 8 comments
Closed

Updating resources on a stack upgrade should be easier #103841

dgieselaar opened this issue Jun 30, 2021 · 8 comments
Labels
discuss Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@dgieselaar
Copy link
Member

As part of the RAC project, we are installing various component and index templates, and creating indices/aliases that use these templates. When we roll out a new version of the stack, some of these templates might have changed, and we need to update the mappings of write indices, and rollover/migrate data when needed.

Currently, our only option is to use the setup or start lifecycle. However, these are executed on every Kibana instance, so any upgrade strategy needs to take into account that several Kibana instances might want to upgrade assets at the same time. We can use a task, but we also need to know when an asset upgrade has been finished, as we need to block write operations until the upgrade has been completed (this might be possible with a task, not sure).

I'd like us to investigate whether we can make this easier, by e.g. providing a hook that is guaranteed to get executed on one Kibana instance, and an afterUpgrade hook that is called on each Kibana instance, or something that allows us to hook into the upgrade process that happens before Kibana starts, in the same vein as the SO migration process.

@dgieselaar dgieselaar added the Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc label Jun 30, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@pgayvallet
Copy link
Contributor

pgayvallet commented Jul 6, 2021

Currently, our only option is to use the setup or start lifecycle. However, these are executed on every Kibana instance, so any upgrade strategy needs to take into account that several Kibana instances might want to upgrade assets at the same time

FWIW, this will always be the case. Even if we were to expose a specific API to register upgrade hooks or functions, these functions would have to be implemented in a way that takes into account that multiples Kibana nodes can be performing the operation concurrently. There is currently no synchronization between Kibana instances, and no real way to acquire a 'lock' from ES. The SO migration algorithm has the same problematic.

@dgieselaar
Copy link
Member Author

@pgayvallet why will this always be the case? We don't support rolling upgrades, no? So all instances go down, first one to upgrade takes care of the upgrade process, and only when successfully completed the other instances come back up (without executing the upgrade process). What am I missing here?

@pgayvallet
Copy link
Contributor

pgayvallet commented Jul 6, 2021

So all instances go down, first one to upgrade takes care of the upgrade process, and only when successfully completed the other instances come back up

This assumption is wrong unfortunately (would be way too easy). All Kibana instances are allowed to boot at the same time during a migration (this is a supported scenario), and we don't have any synchronization mechanism between instances, so each instance do have to take into consideration that other instances can be performing an upgrade at the same time.

I tried to find the document where the whole 'idempotent versus lock' approach discussion occurred a while ago for SO Migv2 to add more context of all the challenges of introducing a lock mechanism, but I couldn't find it. @joshdover @kobelb maybe you have a better memory than I do?

@mshustov
Copy link
Contributor

mshustov commented Jul 7, 2021

Besides the necessity to introduce a consensus protocol, there is a problem of blocking Kibana start - the very reason we deprecated async lifecycles.

A plugin-specific async operation shouldn't block or prevent (in case of an exception) other Kibana plugins to start.
This problem becomes even more relevant in light of the effort to make most of the plugins not-disable-able #89584.

As part of the RAC project, we are installing various component and index templates, and creating indices/aliases that use these templates. When we roll out a new version of the stack, some of these templates might have changed, and we need to update the mappings of write indices, and rollover/migrate data when needed.

@kobelb Can it benefit from the solution you are designing for the automatic upgrade of the Fleet packages?

@joshdover
Copy link
Contributor

I tried to find the document where the whole 'idempotent versus lock' approach discussion occurred a while ago for SO Migv2 to add more context of all the challenges of introducing a lock mechanism, but I couldn't find it. @joshdover @kobelb maybe you have a better memory than I do?

I think this is most complete write-up we have: https:/elastic/kibana/blob/master/rfcs/text/0013_saved_object_migrations.md#52-single-node-migrations-coordinated-through-a-leaselock

Essentially, it's impossible to build a bullet-proof lease/lock on top of Elasticsearch as it is. So in order to use a lock, we'd need to either add Kibana node clustering & master election or work with the Elasticsearch team to provide a first-class lock mechanism.

When we roll out a new version of the stack, some of these templates might have changed, and we need to update the mappings of write indices, and rollover/migrate data when needed.

Given the above, I'm curious which of these operations would be problematic to build in an idempotent way that could be run on all Kibana nodes during start at once.

  • Since all nodes should be writing the same mappings & templates, I don't see any issue with them overriding one another for the mapping and template updates.
    • Like @mshustov, I believe it may make sense to include this in a Fleet package that can be handled by the upgrade mechanism being worked on over there. I don't believe they have a solution for reindexing old data though.
  • Migrating old data should be achievable with a scripted reindex. You can use conflicts=proceed so that if multiple nodes are running the reindex at once, conflicts are just ignored.

Also I'm not sure what's in these specific indices. Is this append-only immutable data, or are these stateful mutable documents? If it's the former, reindexing like this should be pretty safe and straightforward, otherwise some more thought will need to be put into reindexing this data.

@kobelb
Copy link
Contributor

kobelb commented Jul 9, 2021

Building on what @joshdover articulated, ideally, we'd be able to run these migrations scripts "exactly once". However, this is a hard problem to solve when working with a distributed system. In this situation, we have a distributed system because Kibana controls the API calls that need to be made against Elasticsearch.

One of the common tricks to getting "exactly once" semantics is to couple "at least once" with idempotent operations. This is conceptually what @joshdover is recommending above. In this situation, we want Kibana to perform the migrations "at least once" but we need idempotent operations in Elasticsearch to ensure that even though these API calls might be made multiple times, they cause Elasticsearch to be in the same state as if they were only made once.

Kibana can lazily achieve "at least once" by executing the code on literally every start-up operation, and this is possible right now. However, we can consider adding some optimizations to Kibana to make this more efficient and once we have a successful completion, no longer execute this code. This is really just a performance optimization though, as we'll need to anticipate multiple Kibana instances running the migration code in parallel and multiple times consecutively.

@pgayvallet
Copy link
Contributor

Even though the problem is still present, the constraints around its resolutions drastically changed (we do need to support rolling upgrades with serverless now), and AFAIK such needs must be addressed on a case-by-case basis (and we do have a few issues opens for specific needs).

I'll go ahead and close this, feel free to reopen with the updated requierements if necessary.

@pgayvallet pgayvallet closed this as not planned Won't fix, can't repro, duplicate, stale Jul 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
Development

No branches or pull requests

6 participants