Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Initiate Fleet setup on boot #111858

Closed
10 tasks done
joshdover opened this issue Sep 10, 2021 · 13 comments
Closed
10 tasks done

[Fleet] Initiate Fleet setup on boot #111858

joshdover opened this issue Sep 10, 2021 · 13 comments
Assignees
Labels
enhancement New value added to drive a business result required-for-8.0 This work is required to be done before 8.0 lands, bc it relates to a breaking change or similar. Team:Fleet Team label for Observability Data Collection Fleet team v8.0.0

Comments

@joshdover
Copy link
Contributor

joshdover commented Sep 10, 2021

Blocked by:

In order to support smooth Stack upgrades, certain packages, if installed, need to be kept in sync with the Stack version. To accomplish this, the Fleet plugin should initiate its setup process when Kibana starts up rather than waiting for a user to visit the Fleet app in the UI.

Requirements:

  • Add logic to the Fleet plugin's start method to initiate the [setup process] - [Fleet] Move Fleet Setup to start lifecycle #117552
    • This logic should:
    • This will block Kibana startup on this process completing
    • We should ensure that we do not make the default on-prem or development boot time slow. To accomplish this we can either:
      • Separate the managed package upgrades from setup
        • During boot we only ensure that any managed packages are upgraded (if previously installed).
        • Preconfiguration and setup would still be a separate process that runs either via the API or when the user loads the Fleet or Integrations apps.
      • No longer install packages by default and run the full setup & preconfiguration process during boot
        • This would require less changes and result in a cleaner, more clear setup procedure that is simpler to maintain and debug.
        • Default on-prem configuration would be very fast.
        • We can then remove the setup APIs or make them no-ops. Removing them in 8.0 is preferable since it is a breaking change.
  • Upgrading packages should also upgrade all package policies (only for managed packages) - [Fleet] Update logic for "Keep policies up to date" defaults in 8.0 #119126
    • Add a package spec package_policy_upgrade_strategy field for specifying a package policy upgrade strategy
    • Add support using package_policy_upgrade_strategy to decide when to attempt policy upgrades: [Change Proposal] Add package_policy_upgrade_strategy field to support Fleet upgrade behavior package-spec#244
    • Update managed packages to use new field
    • Update: As of 2021-11-18, it seems like the change proposal to add this field to the package spec is not going to result in exactly the functionality we proposed. Discussions are still ongoing, so we've elected to unblock ourselves here and simply continue working with our "hardcoded package list" concept in Fleet. See new tasks below
    • Expand the list of packages for which Fleet automatically upgrades policies to include our AUTO_UPDATE_PACKAGES as well as the existing DEFAULT_PACKAGES
  • Add usage telemetry on upgrades
  • Add custom status to Kibana API - [Fleet] Wire Fleet setup status to core Kibana status API #120020
    • We should report Fleet's upgrade status to the Core status API using the core.status.set API
  • Verify that packages listed in the auto upgrade list should also be downgraded if Kibana is rolled back. - [Fleet] Add tests for rolling back versions of managed packages. #118797
    • This is critical to support Kibana rollbacks
    • From my reading of the code, this should already be the behavior today, but we should add explicit test coverage if we do not already have this.
    • Update: We captured the testing working here: [Fleet] Initiate Fleet setup on boot #111858 (comment) as a manual test process, and the QAS team created a test ticket here: Test ticket for Initiating Fleet setup on boot #120726
    • Update: As of 2021-11-18, we've decided to punt this to the bottom of the list. Complications around writing automated test for this rollback case caused us to reevaluate. Since this task is mainly centered around tests, we're comfortable shipping it after FF for 8.0 if necessary.
  • Remove the /setup and /agents/setup APIs ? This was scrapped
    • They could be useful for troubleshooting purposes, but would have same effect as restarting Kibana and we'd prefer to have a single lifecycle that is done prior to the user using any Fleet features.
    • 8.0 is a good time to remove things
    • Would need to ensure that Agent is updated to remove these API calls
    • If we're not blocking Kibana boot, continuing to call this from the UI when the Fleet app is mounted gives us 'retries'
    • We should determine how slow the no-op scenario is, and if it's slow we can either:
      1. Remove the setup calls from the UI + add a retry button if failing; or
      2. Keep the setup state in memory to make the API faster + add a force option to bypass

Open questions

@joshdover joshdover added enhancement New value added to drive a business result Team:Fleet Team label for Observability Data Collection Fleet team required-for-8.0 This work is required to be done before 8.0 lands, bc it relates to a breaking change or similar. labels Sep 10, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@kpollich
Copy link
Member

We need to force that these packages to always upgrade their package policies and not let this be configurable by the user. @kpollich do we have a mechanism for doing this already?

This is captured as part of our top-level package policy upgrade under the final "Automatic package upgrades" bullet point.

#106048

The plan, for now, is to add a flag to integrations that denotes whether associated package policies should automatically be upgraded when the package is updated. This should eventually be replaced with a value that comes from the actual package spec instead, so that packages like APM can instruct Fleet to automatically upgrade policies instead of relying on user configuration.

We can introduce a piece of reconfiguration for these existing "auto-update" packages that includes this flag, as well. We set these packages up here:

/*
Package rules:
| | unremovablePackages | defaultPackages | autoUpdatePackages |
|---------------|:---------------------:|:---------------:|:------------------:|
| Removable | ❌ | ✔️ | ✔️ |
| Auto-installs | ❌ | ✔️ | ❌ |
| Auto-updates | ❌ | ✔️ | ✔️ |
`endpoint` is a special package. It needs to autoupdate, it needs to _not_ be
removable, but it doesn't install by default. Following the table, it needs to
be in `unremovablePackages` and in `autoUpdatePackages`, but not in
`defaultPackages`.
*/
export const unremovablePackages = [
FLEET_SYSTEM_PACKAGE,
FLEET_ELASTIC_AGENT_PACKAGE,
FLEET_SERVER_PACKAGE,
FLEET_ENDPOINT_PACKAGE,
];
export const defaultPackages = unremovablePackages.filter((p) => p !== FLEET_ENDPOINT_PACKAGE);
export const autoUpdatePackages = [FLEET_ENDPOINT_PACKAGE];

and install them as part of our reconfiguration process here:

packages = [
...packages,
...DEFAULT_PACKAGES.filter((pkg) => !preconfiguredPackageNames.has(pkg.name)),
...autoUpdateablePackages.filter((pkg) => !preconfiguredPackageNames.has(pkg.name)),
];
const { nonFatalErrors } = await ensurePreconfiguredPackagesAndPolicies(
soClient,
esClient,
policies,
packages,
defaultOutput
);

export async function ensurePreconfiguredPackagesAndPolicies(

So, we could set a flag on some or all of these specific preconfigured packages, and if necessary another one to indicate that this piece of configuration is "frozen" and uneditable by the user. When the setup process saves these packages with these flags set, all should function as expected once the implementation specified in the above top-level issue is completed.

@joshdover
Copy link
Contributor Author

joshdover commented Nov 10, 2021

One thing that has come up as part of moving the Fleet setup call to start on Kibana boot is the issue of multiple nodes running the setup concurrently. Today we have a naive guard that prevents this happening on a single node, but nothing that prevents it from happening concurrently on multiple nodes. My thinking is that by moving this to Kibana boot, it’s more likely that the multi-node scenario could happen during upgrades.

Questions:

  • Which pieces of Fleet setup are not idempotent?
  • Which should be safe?
    • Installing all Elasticsearch assets should not cause an issue (index templates, ingest pipelines, transforms, etc.)
  • How can we make sure that switching to deterministic IDs does not create problems for clusters that were setup prior to 8.0 where we weren't using deterministic IDs?
    • I think we can continue to use the same logic for checking if an existing object already exists

Given the above, I think this should be safe to run all nodes if we can make the agent policies and package policy IDs deterministic and ensure that the create calls use overwrite: true to avoid conflict errors.

For Agent policies, we do require that preconfigured policies supply a name here. Would it be safe to use this name to seed a uuidv5 for a deterministic ID? If we did end up creating duplicates, are there any really bad side-effects?

  • It's worth noting that we don't require that an id is supplied if the policy is the default policy or default fleet server policy. Maybe instead of name we should use id and fallback to default_policy or default_fleet_server_policy in cases where there is no ID?

To make package policies deterministic, we can probably piggy back off the agent policy deterministic ID logic and simply append the name parameter to it. This should work because package policies must belong to a single agent policy AND because we enforce global unique package policy names as of #115212

@kpollich
Copy link
Member

Thanks, @joshdover for the thorough explanation of our idempotency issues around setup. It seems to me that you've captured every concern I might've had and provided a path forward.

For Agent policies, we do require that preconfigured policies supply a name here. Would it be safe to use this name to seed a uuidv5 for a deterministic ID? If we did end up creating duplicates, are there any really bad side-effects?

I don't think there are any negative side effects in the case that we create two identical agent policies from preconfiguration. Outside of general confusion for the user, I don't think this would cause any breakdowns in Fleet's functionality.

It's worth noting that we don't require that an id is supplied if the policy is the default policy or default fleet server policy. Maybe instead of name we should use id and fallback to default_policy or default_fleet_server_policy in cases where there is no ID?

It does sound safest to me if we fall back to default_policy in cases like this.

@nchaulet
Copy link
Member

Outputs uses uuidv5

Output can use uuidv5 if the user provide an id but we do not provide an id for the default output (it should not be hard to change that)

I don't think there are any negative side effects in the case that we create two identical agent policies from preconfiguration. Outside of general confusion for the user, I don't think this would cause any breakdowns in Fleet's functionality.

I think it could be an issue as agents could be enrolling based on is_default_fleet_server or is_default

For Agent policies, we do require that preconfigured policies supply a name here. Would it be safe to use this name to seed a uuidv5 for a deterministic ID? If we did end up creating duplicates, are there any really bad side-effects

Yes I think it will make sense to seed an uuid v5 with the name also we enforce name to be unique. For the cloud preconfigured policy they provide an id, should we rather make the id field mandatory in the preconfiguration? (it's what I did for preconfigured output )

@joshdover
Copy link
Contributor Author

@kpollich @nchaulet thanks for the feedback. I've summarized the discussion in this comment on the dedicated issue that Kyle created: #118423 (comment)

@kpollich
Copy link
Member

kpollich commented Dec 1, 2021

Originally contained a WIP version of test instructions. See finalized version below

@joshdover
Copy link
Contributor Author

am I understanding the requirements correctly here and following the expected procedure for rolling back Kibana after an upgrade? I looked around and there's no official way to downgrade Kibana as far as I can tell, so I assume what we were looking for here was a rollback to a previous snapshot, and then confirmation that package versions are reset after that rollback. Does that sound correct?

Here are the full docs, sorry for not surfacing these sooner: https://www.elastic.co/guide/en/kibana/7.16/upgrade-migrations.html#upgrade-migrations-rolling-back

I think the key steps are:

  • Delete all saved object indices with DELETE /.kibana*
  • Restore the kibana feature state or all `.kibana* indices and their aliases from the snapshot

So we don't want to restore the whole snapshot just the .kibana* indices. It also makes a note about shutting down the Kibana nodes first, so you may have to scale Kibana to 0 on Cloud, then restore the snapshot manually via the ES REST API.

@kpollich
Copy link
Member

kpollich commented Dec 2, 2021

It also makes a note about shutting down the Kibana nodes first, so you may have to scale Kibana to 0 on Cloud, then restore the snapshot manually via the ES REST API.

This definitely makes sense, but following these instructions requires this step number 5, which I don't think is possible in Cloud:

Start up all Kibana instances on the older version you wish to rollback to.

I followed the other steps (deleting the .kibana* indices, restoring via the ES API) successfully, but I'm not able to terminate Kibana and restart an earlier version from the Cloud console, it seems.

@kpollich
Copy link
Member

kpollich commented Dec 6, 2021

@EricDavisX - sharing the latest version of manual testing steps for testing our managed packages as they relate to Kibana's upgrade/downgrade process. I just ran through these myself in a local dev environment ping-ponging between 7.15 and 7.16 instances. Hopefully, you are the right person to tag regarding developing a more robust test plan here with our QAS folks? Let me know if I can clarify anything here!


Managed Packages and Kibana Downgrades - Manual Test Instructions

Verify that packages listed in the auto upgrade list should also be downgraded if Kibana is rolled back.

We investigated creating some automated tests to cover this case in #118797, but encountered some difficulties that prevented us from making much progress. We're instead opting to document manual testing procedures for Kibana downgrades and how they interact with our various "managed" integrations in Fleet.

Manual Testing Procedure

Goal: Ensure that managed packages that we consider as "default" or "auto update" packages are downgraded when Kibana is rolled back.

These steps are written to be performed on self-hosted Kibana, as downgrading Kibana in cloud is not currently supported.

List of packages under test:

  • Default Packages
    • System
    • Elastic Agent
    • Fleet Server
  • Additional Packages
    • APM
    • Endpoint
    • Synthetics

Start up a 7.15 environment

Start up a fresh instance of Elasticsearch on a 7.15 snapshot as well as a fresh Kibana 7.15 instance. Ensure Fleet setup is completed by visiting the Fleet application Kibana and waiting for the loading indicator to disappear and for the Fleet UI to appear.

Ensure Kibana is set up for on-disk backups

Add an path value to your elasticsearch.yml file or via command line arguments to ensure Elasticsearch is configured to store on-disk snapshots, e.g.

path:
  repo:
    /tmp/es-backups

# Or from the CLI
$ -E path.repo=/tmp/es-backups

Install additional managed packages

Install the following non-default managed packages. We don't need to create package policies here, so navigating to the integration's Settings tab and clicking Install [integration] assets should suffice.

  • APM
  • Endpoint
  • Synthetics

Confirm integration versions for managed packages

Confirm the versions of all managed packages. We'll reference these versions later when we upgrade and then eventually downgrade again.

Integration Version
APM 0.4.0
Elastic Agent 1.2.1
Endpoint 1.1.1
Fleet Server 1.0.1
Synthetics 0.3.0
System 1.6.3

Snapshot your Kibana data

Register a repository

Via the Stack Management -> Data -> Snapshot and Restore UI, register a repository using the "Shared file system" option. Give it a name e.g. my-repository and provide the path you configured above: /tmp/es-backups.

Follow the docs to create a snapshot of your instance via Kibana dev tools, e.g.

PUT /_snapshot/my-repository/[timestamp]_snapshot?wait_for_completion=true

Upgrade to a 7.16 environment

Stop your 7.15 environment, and run a 7.16 environment in its place. Ensure you provide the same configuration values for the path.repo field as above. Start up a 7.16 instance of Elasticsearch and Kibana.

Run Fleet setup again

Navigate to /app/fleet and make sure the Fleet setup process has run successfully again.

Confirm integration versions for managed packages

Your integrations should have a few new versions in 7.16. These are called out in bold.

Integration Version
APM 0.4.0
Elastic Agent 1.3.0
Endpoint 1.2.2
Fleet Server 1.1.0
Synthetics 0.5.0
System 1.6.3

Elastic Agent, Endpoint, Fleet Server, and Synthetics should all have upgraded to new versions, while APM and System should remain on their existing version through the upgrade process.

Roll back to 7.15

Follow the rollback documentation to rollback to 7.15. The key steps of this process should be:

  1. Stop your 7.16 Kibana instance

  2. Make a DELETE /.kibana* API request to your Elasticsearch instance

  3. Restore your previous 7.15 data using the Elasticsearch API, e,g

    # You may need to do this first to allow wildcard deletes
    PUT _cluster/settings
    {
      "persistent" : {
        "action.destructive_requires_name" : false
      }
    }
    
    # Actual delete command
    POST _snapshot/my-repository/[timestamp]_snapshot/_restore
    {
      "indices": ".kibana*"
    }
    
  4. Start a 7.15 Kibana instance

Confirm integration versions for managed packages

Your packages should have downgraded to the versions they were prior to upgrading to 7.16. Kibana should not have maintained the newer versions of any packages.

Integration Version
APM 0.4.0
Elastic Agent 1.2.1
Endpoint 1.1.1
Fleet Server 1.0.1
Synthetics 0.3.0
System 1.6.3

@EricDavisX
Copy link
Contributor

@kpollich thanks, you can reach out to me, sure. I am going to pass this on to @dikshachauhan-qasource and @sagarnagpal-qasource to review and submit a greater testing assessment to us as to what combinations are possible from your comment just above, #111858 (comment)

@joshdover
Copy link
Contributor Author

Going to close this issue as the implementation work here is done. If needed, please open a new issue for testing or continue discussing right here.

@dikshachauhan-qasource
Copy link

Hi @EricDavisX

We have attempted it to validate as per steps mentioned and shared observations on related testing ticket : #120726

Thanks
QAS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New value added to drive a business result required-for-8.0 This work is required to be done before 8.0 lands, bc it relates to a breaking change or similar. Team:Fleet Team label for Observability Data Collection Fleet team v8.0.0
Projects
None yet
Development

No branches or pull requests

7 participants