-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory-heavy workloads may be scaled too high #1030
Labels
c/autoscaling/autoscaler-agent
Component: autoscaling: autoscaler-agent
c/autoscaling/vm-monitor
Component: autoscaling: vm-monitor
t/bug
Issue Type: Bug
Comments
sharnoff
added
c/autoscaling/autoscaler-agent
Component: autoscaling: autoscaler-agent
c/autoscaling/vm-monitor
Component: autoscaling: vm-monitor
t/bug
Issue Type: Bug
labels
Aug 9, 2024
sharnoff
added a commit
that referenced
this issue
Aug 9, 2024
In short: In addition to scaling when there's a lot of memory used by postgres, we should also scale up to make sure that enough of the LFC is able to fit into the page cache alongside it. To answer "how much is enough of the LFC", we take the minimum of 5-minute LFC working set size (from window size) and the cached memory (from the 'Cached' field of /proc/meminfo, via vector metrics). Part of #1030. Must be deployed before the vm-monitor changes in order to make sure we don't have worse performance for workloads that are both memory-heavy and rely on LFC being in the VM's page cache.
sharnoff
added a commit
to neondatabase/neon
that referenced
this issue
Aug 9, 2024
In short: Currently we reserve 75% of memory to the LFC, meaning that if we scale up to keep postgres using less than 25% of the compute's memory. This means that for certain memory-heavy workloads, we end up scaling much higher than is actually needed — in the worst case, up to 4x, although in practice it tends not to be quite so bad. Part of neondatabase/autoscaling#1030. Must be deployed after the autoscaler-agent changes in order to make sure we don't have worse performance for workloads that are both memory-heavy and rely on LFC being in the VM's page cache.
This was referenced Aug 9, 2024
sharnoff
added a commit
that referenced
this issue
Aug 17, 2024
In short: In addition to scaling when there's a lot of memory used by postgres, we should also scale up to make sure that enough of the LFC is able to fit into the page cache alongside it. To answer "how much is enough of the LFC", we take the minimum of the estimated working set size and the cached memory (from the 'Cached' field of /proc/meminfo, via vector metrics). Part of #1030. Must be deployed before the vm-monitor changes in order to make sure we don't have worse performance for workloads that are both memory-heavy and rely on LFC being in the VM's page cache.
sharnoff
added a commit
that referenced
this issue
Sep 6, 2024
In short: In addition to scaling when there's a lot of memory used by postgres, we should also scale up to make sure that enough of the LFC is able to fit into the page cache alongside it. To answer "how much is enough of the LFC", we take the minimum of the estimated working set size and the cached memory (from the 'Cached' field of /proc/meminfo, via vector metrics). Part of #1030. Must be deployed before the vm-monitor changes in order to make sure we don't have worse performance for workloads that are both memory-heavy and rely on LFC being in the VM's page cache.
sharnoff
added a commit
that referenced
this issue
Sep 10, 2024
In short: In addition to scaling when there's a lot of memory used by postgres, we should also scale up to make sure that enough of the LFC is able to fit into the page cache alongside it. To answer "how much is enough of the LFC", we take the minimum of the estimated working set size and the cached memory (from the 'Cached' field of /proc/meminfo, via vector metrics). Part of #1030. Must be deployed before the vm-monitor changes in order to make sure we don't have worse performance for workloads that are both memory-heavy and rely on LFC being in the VM's page cache.
sharnoff
added a commit
that referenced
this issue
Sep 19, 2024
In short: In addition to scaling when there's a lot of memory used by postgres, we should also scale up to make sure that enough of the LFC is able to fit into the page cache alongside it. To answer "how much is enough of the LFC", we take the minimum of the estimated working set size and the cached memory (from the 'Cached' field of /proc/meminfo, via vector metrics). Part of #1030. Must be deployed before the vm-monitor changes in order to make sure we don't have worse performance for workloads that are both memory-heavy and rely on LFC being in the VM's page cache.
sharnoff
added a commit
that referenced
this issue
Sep 19, 2024
In short: In addition to scaling when there's a lot of memory used by postgres, we should also scale up to make sure that enough of the LFC is able to fit into the page cache alongside it. To answer "how much is enough of the LFC", we take the minimum of the estimated working set size and the cached memory (from the 'Cached' field of /proc/meminfo, via vector metrics). Part of #1030. Must be deployed before the vm-monitor changes in order to make sure we don't have worse performance for workloads that are both memory-heavy and rely on LFC being in the VM's page cache.
sharnoff
added a commit
that referenced
this issue
Sep 19, 2024
In short: In addition to scaling when there's a lot of memory used by postgres, we should also scale up to make sure that enough of the LFC is able to fit into the page cache alongside it. To answer "how much is enough of the LFC", we take the minimum of the estimated working set size and the cached memory (from the 'Cached' field of /proc/meminfo, via vector metrics). Part of #1030. Must be deployed before the vm-monitor changes in order to make sure we don't have worse performance for workloads that are both memory-heavy and rely on LFC being in the VM's page cache.
sharnoff
added a commit
to neondatabase/neon
that referenced
this issue
Oct 7, 2024
In short: Currently we reserve 75% of memory to the LFC, meaning that if we scale up to keep postgres using less than 25% of the compute's memory. This means that for certain memory-heavy workloads, we end up scaling much higher than is actually needed — in the worst case, up to 4x, although in practice it tends not to be quite so bad. Part of neondatabase/autoscaling#1030.
Now that neondatabase/neon#8668 has been merged, this will be fixed with the next compute release containing it. |
erikgrinaker
pushed a commit
to neondatabase/neon
that referenced
this issue
Oct 8, 2024
In short: Currently we reserve 75% of memory to the LFC, meaning that if we scale up to keep postgres using less than 25% of the compute's memory. This means that for certain memory-heavy workloads, we end up scaling much higher than is actually needed — in the worst case, up to 4x, although in practice it tends not to be quite so bad. Part of neondatabase/autoscaling#1030.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
c/autoscaling/autoscaler-agent
Component: autoscaling: autoscaler-agent
c/autoscaling/vm-monitor
Component: autoscaling: vm-monitor
t/bug
Issue Type: Bug
Problem description / Motivation
Currently, the vm-monitor:
This works ok as a naive solution for most OLTP workloads, but happens to mean that certain memory-heavy workloads can be scaled higher than they need (note: excluding cache usage by LFC, so allocations are elsewhere — like in pgvector index build).
Meanwhile the autoscaler-agent triggers upscaling based on memory usage if postgres' memory usage exceeds 75% of memory... so it's basically always handled by the vm-monitor first in practice.
This came up in this thread: https://neondb.slack.com/archives/C03TN5G758R/p1723127762991289
Feature idea(s) / DoD
We should be more careful about how we treat memory usage as a scaling signal, so that memory-heavy workloads are no longer scaled up beyond what's necessary, while also making sure that we don't harm performance for workloads that are memory-heavy and also rely on LFC being in the OS page cache.
Implementation ideas
See https://www.notion.so/neondatabase/0f75b15d47ad479094861302a99114af
Tasks
The text was updated successfully, but these errors were encountered: