Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Heartbeat] Monitor Retries #36147

Merged
merged 79 commits into from
Aug 31, 2023
Merged

[Heartbeat] Monitor Retries #36147

merged 79 commits into from
Aug 31, 2023

Conversation

andrewvc
Copy link
Contributor

After experimenting around a bit, this gets retries working, using code at the monitor level. This also really neatly refactors the summary and state wrappers into much easier to understand code. This is an experiment only. Code is still pretty ugly and we should reorganize parameters and objects into something neater than what's here.

The main question this seeks to answer is where we should put this logic.

Sample output of:

x-pack/heartbeat$  mage build && ELASTIC_SYNTHETICS_CAPABLE=true ./heartbeat -e | jq -c  '{"monitor": {"name": .monitor.name, "status": .monitor.status}, "summary": .summary, "state": .state}'
{"monitor":{"name":"Test","status":"up"},"summary":null,"state":null}
{"monitor":{"name":"Test","status":"up"},"summary":null,"state":null}
{"monitor":{"name":"Test","status":"up"},"summary":null,"state":null}
{"monitor":{"name":"Test","status":"up"},"summary":{"status":"up","retry_group":"R-netflix-1898aee1e84","attempt":1,"max_attempts":2,"final_attempt":false,"up":5,"down":0},"state":{"duration_ms":"0","up":1,"id":"default-1898aee1ec0-0","started_at":"2023-07-24T21:43:28.576999-05:00","status":"up","checks":1,"down":0,"flap_history":[],"ends":null}}
{"monitor":{"name":"Test","status":"up"},"summary":null,"state":null}
{"monitor":{"name":"Test","status":"up"},"summary":null,"state":null}
{"monitor":{"name":"Test","status":"up"},"summary":null,"state":null}
{"monitor":{"name":"Test","status":"up"},"summary":{"up":5,"down":0,"status":"up","retry_group":"R-netflix-1898aee1e84","attempt":2,"max_attempts":2,"final_attempt":true},"state":{"id":"default-1898aee1ec0-0","started_at":"2023-07-24T21:43:28.576999-05:00","duration_ms":"23","checks":2,"down":0,"ends":null,"status":"up","up":2,"flap_history":[]}}

Why is it important?

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

  • [ ]

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

@andrewvc andrewvc added enhancement Heartbeat Team:obs-ds-hosted-services Label for the Observability Hosted Services team labels Jul 25, 2023
@andrewvc andrewvc requested a review from a team as a code owner July 25, 2023 02:48
@andrewvc andrewvc self-assigned this Jul 25, 2023
@elasticmachine
Copy link
Collaborator

Pinging @elastic/uptime (Team:Uptime)

@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Jul 25, 2023
@mergify
Copy link
Contributor

mergify bot commented Jul 25, 2023

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @andrewvc? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@elasticmachine
Copy link
Collaborator

elasticmachine commented Jul 25, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-08-31T18:00:47.286+0000

  • Duration: 92 min 39 sec

Test stats 🧪

Test Results
Failed 0
Passed 28157
Skipped 2015
Total 30172

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@andrewvc
Copy link
Contributor Author

@emilioalvap thanks for the review feedback, I think I've addressed your concerns. Now if I can make the linter happy 🤞 I think the build may go green soon

Copy link
Collaborator

@emilioalvap emilioalvap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One last concern, changes to enrich.go we discussed the other day about emitting summaries on cmd/status only don't seem to have been included yet:

return je.createSummary(event)

@pierrehilbert pierrehilbert added the Team:Elastic-Agent Label for the Agent team label Aug 31, 2023
@@ -17,7 +17,6 @@ import (
type loaderDB struct {
keysToState map[string]*monitorstate.State
mtx *sync.Mutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this needs to be a pointer.

Copy link
Contributor

@ycombinator ycombinator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm far from an expert in Heartbeat / Synthetics so I didn't review the logic in this PR. But I made a few general Golang code review comments.

@andrewvc
Copy link
Contributor Author

@ycombinator thanks for the review, I think I've addressed all of your comments

Copy link
Contributor

@ycombinator ycombinator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@andrewvc andrewvc merged commit a6bae85 into elastic:main Aug 31, 2023
9 checks passed
@andrewvc andrewvc deleted the retestsched branch August 31, 2023 21:00
@andrewvc andrewvc mentioned this pull request Sep 7, 2023
3 tasks
@@ -91,7 +91,7 @@ func (s *Summarizer) Wrap(j jobs.Job) jobs.Job {
return func(event *beat.Event) ([]jobs.Job, error) {
conts, jobErr := j(event)

_, _ = event.PutValue("monitor.check_group", s.checkGroup)
_, _ = event.PutValue("monitor.check_group", fmt.Sprintf("%s-%d", s.checkGroup, s.jobSummary.Attempt))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewvc Sorry for the delayed note on this PR, Would this break the existing uptime UI in any way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the UI treats the UUID as opaque values, as does ES (it's a keyword type, there's no dedicated UUID type).

Scholar-Li pushed a commit to Scholar-Li/beats that referenced this pull request Feb 5, 2024
Adds retries to Heartbeat monitors. Part of elastic/synthetics#792

This refactors a ton of code around summarizing events, and cleans up a lot of tech debt as well.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Heartbeat Team:Automation Label for the Observability productivity team Team:Elastic-Agent Label for the Agent team Team:obs-ds-hosted-services Label for the Observability Hosted Services team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants