Epic: Separately tagged logs for VM processes, dmesg, and runner #578

sharnoff · 2023-10-21T05:55:47Z

Motivation

For each VM, there's many programs generating logs that we care about, but because they all get aggregated into a single stream, it can be hard to filter for just the ones you care about (or even attribute which component a particular log line came from). The full list of components is:
- neonvm-runner
- QEMU (?) — we don't currently have logs for QEMU itself, but it'd be great to!
- VM kernel logs (dmesg)
- vector (running inside the VM to provide metrics)
- postgres_exporter
- pgbouncer
- postgres
- compute_ctl
- vm-monitor
- chrony
Logs from the VM kernel can interrupt the middle of other log lines, which impacts our ability to search existing logs

These combine to significantly impair the UX of our observability for VMs.

DoD

All logs from within the VM are easily attributable to a particular component
Filtering for logs from a particular component is trivial, and works within the bounds of our existing systems (e.g. by using log labeling)
Logs from the VM kernel cannot interfere with other logs
Logs from a VM can still be viewed with kubectl logs during local development

Implementation ideas

TODO (various ideas, need to discuss)

Tasks

Give feedback

...
List tasks as they're created for this Epic
Options

Other related tasks, Epics, and links

Epic: Unify vm-builders #577
- Probably not possible to complete this without that, but but there's progress that can be made in parallel
neondatabase/cloud#8602
See also: "Maybe we should bite the bullet and stop combining Postgres' logs with those of our own binaries"

The text was updated successfully, but these errors were encountered:

lassizci · 2023-10-23T13:35:59Z

@lassizci

Omrigan · 2023-12-25T12:16:10Z

Can we utilize vector, which we already have inside the VM, to push logs directly to loki, which we also already have?

lassizci · 2023-12-25T12:28:08Z

It’s not a good practice from security perspective to have credentials in the virtual machines. Also if we thunk about reconfigurability, it’s best to have as little expectations regarding observability built in as possible, so the pipeline can evolve independently without the need to do reconfigurations in the compute instance level.

Omrigan · 2023-12-25T13:07:49Z

It’s not a good practice from security perspective to have credentials in the virtual machines.

But this would be write-only credentials. In such a case, we can only have DoS because of too many logs, which we can combat on the receiver end.

Another option is to have a separate instance of vector outside the VM in the pod, configured to pull data from the in-VM instance [1].

Also if we thunk about reconfigurability, it’s best to have as little expectations regarding observability built in as possible, so the pipeline can evolve independently without the need to do reconfigurations in the compute instance level.

What do you mean? Are you talking about updating credentials?

Or, in general, dependence on the particular observability agent? Such dependence, I believe, we cannot escape.

1: https://vector.dev/docs/reference/configuration/sources/vector/

lassizci · 2023-12-25T13:56:30Z

It’s not a good practice from security perspective to have credentials in the virtual machines.

But this would be write-only credentials. In such a case, we can only have DoS because of too many logs, which we can combat on the receiver end.

If we skip the collector we control, we can not deal with DoS at the receiving end. Postgresql escape would potentially give control over labeling etc.

We also do processing in between collection and sending the logs (relabeling, perhaps metrics from logs, switching between plaintext and json and so on…). Also queueing of the log sending should not happen inside the computes, but in trusted environment.

Lets say our log storage is offline and the compute suspends. That would either mean losing the logs or keeping the compute online for retries.

Another option is to have a separate instance of vector outside the VM in the pod, configured to pull data from the in-VM instance [1].

I think what makes the most sense is to write logs to a sovket, provided by the host. Then we can consider the further pipeline as an implementation detail.

Also if we thunk about reconfigurability, it’s best to have as little expectations regarding observability built in as possible, so the pipeline can evolve independently without the need to do reconfigurations in the compute instance level.

What do you mean? Are you talking about updating credentials?

Updating/rotating the credentials is one thing. Building metrics from the logs, relabeling, adding labels, changing the log collector to something else.

Or, in general, dependence on the particular observability agent? Such dependence, I believe, we cannot escape.

We can switch observability agent rather easily when it runs outside of the virtualmachines. That’s currently possible and I don’t think it makes much sense to make it harder, nor waste customer’s cpu time and memory for running such things.

sharnoff · 2023-12-26T19:21:07Z

From discussing with @Omrigan earlier: One simplification we can make is to just get logs from the VM to stdout in neonvm-runner (the container running the VM) — we already have logs collection in k8s, so we can just piggy-back on that, which makes it easier than trying to push the logs to some other place.

sharnoff · 2023-12-27T16:27:48Z

Notes from discussion:

We should postpone this until we switch to systemd in the VMs, because of the extra tooling that systemd gives us

Omrigan · 2024-02-13T18:15:12Z

We have an occurrence of non-postgres log spam (in this case, oom-killer), which won't be fixed by https:/neondatabase/cloud/issues/8602

https://neondb.slack.com/archives/C03F5SM1N02/p1707489906661529

sharnoff · 2024-04-26T23:26:58Z

Occurrence of log interleaving that could potentially be fixed by this, depending how we implement it: https://neondb.slack.com/archives/C03TN5G758R/p1714057349130309

knz · 2024-10-17T14:39:54Z

xref https:/neondatabase/cloud/issues/18244
We have customer ask to export the postgres logs to an external service, so they can inspect their own logs themselves (e.g via datadog).

We haven't fully specced that out yet but the assumption so far is that we would be reusing the OpenTelemetry collector that we already deploy to collect metrics, and route the logs through this.

knz · 2024-10-17T14:41:56Z

Regarding pushing logs to console / k8s logs: the volume will be too large in some cases, eg if the user cares about pg_audit logs. This will become a bottleneck. Also it will not solve the labeling problem, which we care about for product -- customer only wants their postgres logs, not our own control logs. Better export through the network directly (see point below).

Regarding push/pull and credentials: one option is to have a service running inside the VM that accepts incoming connections, and delivers the logs from the VM through that. Would that solve the problem?

sharnoff added t/Epic Issue type: Epic c/autoscaling/neonvm Component: autoscaling: NeonVM labels Oct 21, 2023

sharnoff mentioned this issue Oct 21, 2023

Epic: Unify vm-builders #577

Closed

2 tasks

Omrigan self-assigned this Dec 25, 2023

sharnoff mentioned this issue Jan 8, 2024

Switch NeonVMs to systemd #728

Open

Omrigan mentioned this issue Jan 31, 2024

Improve compute.log readability neondatabase/neon#4430

Open

sharnoff mentioned this issue Feb 27, 2024

Epic: Standard interface for running containers AND/OR systemd services in VMs #835

Open

Omrigan mentioned this issue May 20, 2024

Bug: compute_ctl can't open neon.tech.log.0 after restart #939

Closed

sharnoff mentioned this issue Oct 16, 2024

dmesg logs can interleave with normal programs #1115

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: Separately tagged logs for VM processes, dmesg, and runner #578

Epic: Separately tagged logs for VM processes, dmesg, and runner #578

sharnoff commented Oct 21, 2023 •

edited by Omrigan

Loading

Tasks

lassizci commented Oct 23, 2023

Omrigan commented Dec 25, 2023

lassizci commented Dec 25, 2023

Omrigan commented Dec 25, 2023

lassizci commented Dec 25, 2023 •

edited

Loading

sharnoff commented Dec 26, 2023

sharnoff commented Dec 27, 2023

Omrigan commented Feb 13, 2024

sharnoff commented Apr 26, 2024

knz commented Oct 17, 2024

knz commented Oct 17, 2024

Epic: Separately tagged logs for VM processes, dmesg, and runner #578

Epic: Separately tagged logs for VM processes, dmesg, and runner #578

Comments

sharnoff commented Oct 21, 2023 • edited by Omrigan Loading

Motivation

DoD

Implementation ideas

Tasks

Tasks

Other related tasks, Epics, and links

lassizci commented Oct 23, 2023

Omrigan commented Dec 25, 2023

lassizci commented Dec 25, 2023

Omrigan commented Dec 25, 2023

lassizci commented Dec 25, 2023 • edited Loading

sharnoff commented Dec 26, 2023

sharnoff commented Dec 27, 2023

Omrigan commented Feb 13, 2024

sharnoff commented Apr 26, 2024

knz commented Oct 17, 2024

knz commented Oct 17, 2024

sharnoff commented Oct 21, 2023 •

edited by Omrigan

Loading

lassizci commented Dec 25, 2023 •

edited

Loading