Relibility: How to Recover from System Downtime? #69

clean99 · 2023-07-05T10:49:11Z

clean99
Jul 5, 2023

Now the data is stored in memory(correct me if I m wrong), which means we will lose all data after a system downtime.

For most developer, the most important metric to get is when the system has issues(which potentially cause a fault or error), and if once the system shut down, the data are all lost, then it loses a big change to help user to improve their system.

So I wonder if we can store the data in a persistent way, which not only benefits recovering but also can reduce memory load(memory are more expensive than disk).

Answered by gagbo

Jul 5, 2023

The metrics are exposed in the Prometheus format. That means that the data is stored in the Prometheus instance that polls the system, not in the system being monitored.

If the system fails catastrophically anyways, there's no means to guarantee that a "graceful shutdown" would happen, saving some logs or metrics in a persistent storage. That's why usually you want to store the data in a system that's different than the one you want to monitor (here, the other system storing data is "the prometheus instance"), and why information around system downtime is always "best effort".

And runtime, the system being monitored only stores the current values of the metrics, not the timeseries, so the…

View full answer

gagbo · 2023-07-05T11:28:24Z

gagbo
Jul 5, 2023
Maintainer

The metrics are exposed in the Prometheus format. That means that the data is stored in the Prometheus instance that polls the system, not in the system being monitored.

If the system fails catastrophically anyways, there's no means to guarantee that a "graceful shutdown" would happen, saving some logs or metrics in a persistent storage. That's why usually you want to store the data in a system that's different than the one you want to monitor (here, the other system storing data is "the prometheus instance"), and why information around system downtime is always "best effort".

And runtime, the system being monitored only stores the current values of the metrics, not the timeseries, so the memory footprint is as low as possible.

1 reply

emschwartz Jul 10, 2023

Just to add to this, we are indeed relying on Prometheus to store the metrics and also keep track of if the system goes down completely. Prometheus actually has a built-in metric called up that indicates whether the service is reachable when it tries to scrape the metrics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autometrics

Relibility: How to Recover from System Downtime? #69

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Autometrics

Relibility: How to Recover from System Downtime? #69

clean99 Jul 5, 2023

Replies: 1 comment · 1 reply

gagbo Jul 5, 2023 Maintainer

emschwartz Jul 10, 2023

clean99
Jul 5, 2023

Replies: 1 comment 1 reply

gagbo
Jul 5, 2023
Maintainer