Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic: Separately tagged logs for VM processes, dmesg, and runner #578

Open
2 tasks
sharnoff opened this issue Oct 21, 2023 · 11 comments
Open
2 tasks

Epic: Separately tagged logs for VM processes, dmesg, and runner #578

sharnoff opened this issue Oct 21, 2023 · 11 comments
Assignees
Labels
c/autoscaling/neonvm Component: autoscaling: NeonVM t/Epic Issue type: Epic

Comments

@sharnoff
Copy link
Member

sharnoff commented Oct 21, 2023

Motivation

  1. For each VM, there's many programs generating logs that we care about, but because they all get aggregated into a single stream, it can be hard to filter for just the ones you care about (or even attribute which component a particular log line came from). The full list of components is:
    • neonvm-runner
    • QEMU (?) — we don't currently have logs for QEMU itself, but it'd be great to!
    • VM kernel logs (dmesg)
    • vector (running inside the VM to provide metrics)
    • postgres_exporter
    • pgbouncer
    • postgres
    • compute_ctl
    • vm-monitor
    • chrony
  2. Logs from the VM kernel can interrupt the middle of other log lines, which impacts our ability to search existing logs

These combine to significantly impair the UX of our observability for VMs.

DoD

  1. All logs from within the VM are easily attributable to a particular component
  2. Filtering for logs from a particular component is trivial, and works within the bounds of our existing systems (e.g. by using log labeling)
  3. Logs from the VM kernel cannot interfere with other logs
  4. Logs from a VM can still be viewed with kubectl logs during local development

Implementation ideas

TODO (various ideas, need to discuss)

Tasks

Tasks

Other related tasks, Epics, and links

@sharnoff sharnoff added t/Epic Issue type: Epic c/autoscaling/neonvm Component: autoscaling: NeonVM labels Oct 21, 2023
@sharnoff sharnoff mentioned this issue Oct 21, 2023
2 tasks
@lassizci
Copy link
Contributor

@lassizci

@Omrigan Omrigan self-assigned this Dec 25, 2023
@Omrigan
Copy link
Contributor

Omrigan commented Dec 25, 2023

Can we utilize vector, which we already have inside the VM, to push logs directly to loki, which we also already have?

@lassizci
Copy link
Contributor

It’s not a good practice from security perspective to have credentials in the virtual machines. Also if we thunk about reconfigurability, it’s best to have as little expectations regarding observability built in as possible, so the pipeline can evolve independently without the need to do reconfigurations in the compute instance level.

@Omrigan
Copy link
Contributor

Omrigan commented Dec 25, 2023

It’s not a good practice from security perspective to have credentials in the virtual machines.

But this would be write-only credentials. In such a case, we can only have DoS because of too many logs, which we can combat on the receiver end.

Another option is to have a separate instance of vector outside the VM in the pod, configured to pull data from the in-VM instance [1].

Also if we thunk about reconfigurability, it’s best to have as little expectations regarding observability built in as possible, so the pipeline can evolve independently without the need to do reconfigurations in the compute instance level.

What do you mean? Are you talking about updating credentials?

Or, in general, dependence on the particular observability agent? Such dependence, I believe, we cannot escape.

1: https://vector.dev/docs/reference/configuration/sources/vector/

@lassizci
Copy link
Contributor

lassizci commented Dec 25, 2023

It’s not a good practice from security perspective to have credentials in the virtual machines.

But this would be write-only credentials. In such a case, we can only have DoS because of too many logs, which we can combat on the receiver end.

If we skip the collector we control, we can not deal with DoS at the receiving end. Postgresql escape would potentially give control over labeling etc.

We also do processing in between collection and sending the logs (relabeling, perhaps metrics from logs, switching between plaintext and json and so on…). Also queueing of the log sending should not happen inside the computes, but in trusted environment.

Lets say our log storage is offline and the compute suspends. That would either mean losing the logs or keeping the compute online for retries.

Another option is to have a separate instance of vector outside the VM in the pod, configured to pull data from the in-VM instance [1].

I think what makes the most sense is to write logs to a sovket, provided by the host. Then we can consider the further pipeline as an implementation detail.

Also if we thunk about reconfigurability, it’s best to have as little expectations regarding observability built in as possible, so the pipeline can evolve independently without the need to do reconfigurations in the compute instance level.

What do you mean? Are you talking about updating credentials?

Updating/rotating the credentials is one thing. Building metrics from the logs, relabeling, adding labels, changing the log collector to something else.

Or, in general, dependence on the particular observability agent? Such dependence, I believe, we cannot escape.

We can switch observability agent rather easily when it runs outside of the virtualmachines. That’s currently possible and I don’t think it makes much sense to make it harder, nor waste customer’s cpu time and memory for running such things.

@sharnoff
Copy link
Member Author

From discussing with @Omrigan earlier: One simplification we can make is to just get logs from the VM to stdout in neonvm-runner (the container running the VM) — we already have logs collection in k8s, so we can just piggy-back on that, which makes it easier than trying to push the logs to some other place.

@sharnoff
Copy link
Member Author

Notes from discussion:

  • We should postpone this until we switch to systemd in the VMs, because of the extra tooling that systemd gives us

@Omrigan
Copy link
Contributor

Omrigan commented Feb 13, 2024

We have an occurrence of non-postgres log spam (in this case, oom-killer), which won't be fixed by https:/neondatabase/cloud/issues/8602

https://neondb.slack.com/archives/C03F5SM1N02/p1707489906661529

@sharnoff
Copy link
Member Author

Occurrence of log interleaving that could potentially be fixed by this, depending how we implement it: https://neondb.slack.com/archives/C03TN5G758R/p1714057349130309

@knz
Copy link

knz commented Oct 17, 2024

xref https:/neondatabase/cloud/issues/18244
We have customer ask to export the postgres logs to an external service, so they can inspect their own logs themselves (e.g via datadog).

We haven't fully specced that out yet but the assumption so far is that we would be reusing the OpenTelemetry collector that we already deploy to collect metrics, and route the logs through this.

@knz
Copy link

knz commented Oct 17, 2024

Regarding pushing logs to console / k8s logs: the volume will be too large in some cases, eg if the user cares about pg_audit logs. This will become a bottleneck. Also it will not solve the labeling problem, which we care about for product -- customer only wants their postgres logs, not our own control logs. Better export through the network directly (see point below).

Regarding push/pull and credentials: one option is to have a service running inside the VM that accepts incoming connections, and delivers the logs from the VM through that. Would that solve the problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/autoscaling/neonvm Component: autoscaling: NeonVM t/Epic Issue type: Epic
Projects
None yet
Development

No branches or pull requests

4 participants