Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Proposal for easier cache tuning. #9903

Closed
erikjohnston opened this issue Apr 28, 2021 · 5 comments
Closed

Proposal for easier cache tuning. #9903

erikjohnston opened this issue Apr 28, 2021 · 5 comments
Assignees
Labels
T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks.

Comments

@erikjohnston
Copy link
Member

We want caches to be more easily tunable (without requiring expert knowledge), and ideally for the defaults to provide a decent experience out of the box. The flip side is that we want to ensure that we don't use unnecessary amounts of memory, as a) we want to keep memory usage low in general, and b) more memory increases the GC resource usage.

What I'm currently thinking is being able to set thresholds based on the total memory usage of caches and/or last access time. That way we can have levels such as:

  1. Evict anything older than 30mins
  2. Evict anything older than 15 mins while memory usage is above 512MB
  3. Evict anything older than 5 mins while memory usage is above 1GB

The idea being that we want to stay under the memory limits without evicting recently used stuff from the caches. It'll probably take some experimentation to come up with the right values.

This could efficiently be implemented by having the LruCache also insert nodes into a global linked list, with last access times tracked per node. To use memory limits we'd also need to either a) enable the memory tracking in #9881, or b) use jemalloc to get an accurate measure of how much allocated memory is actually being used.

We may also want to take into account how often a node has been accessed. E.g. we may wish to expire nodes that have only ever been accessed once more aggressively than for those that get accessed more periodically (the LRU nature of the cache takes this into account somewhat already).

@erikjohnston erikjohnston added the T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. label Apr 28, 2021
@erikjohnston
Copy link
Member Author

We may also want to take into account how often a node has been accessed. E.g. we may wish to expire nodes that have only ever been accessed once more aggressively than for those that get accessed more periodically (the LRU nature of the cache takes this into account somewhat already).

We could also use a counting bloom filter to count how many times a key was accessed in the last N minutes, this would allow things like "store in cache for 5 minutes, unless its been accessed more than once in the last 1hour", i.e. we could quickly drop items that only ever get accessed once, but retain nodes that have been accessed multiple times (though this likely means that we end up adding the value to the cache, dropping it, then re-adding it to the cache, at which point it stays there for a while).

The reason to do this is to try and get a good balance between evicting stuff quickly that never gets used again, vs retaining items that get periodically queried at a period greater than the expiry.

@richvdh
Copy link
Member

richvdh commented Apr 29, 2021

I think that some sort of cross-cache memory management is a good first step here. We can add fancier things like "how many times was it used in the last hour" later. (Likewise: I always think it would be good to take into account the amount of recalculation we save by having something in the cache, which depends on what it is that is cached. But again, I'd add that later.)

@erikjohnston
Copy link
Member Author

This could efficiently be implemented by having the LruCache also insert nodes into a global linked list, with last access times tracked per node. To use memory limits we'd also need to either a) enable the memory tracking in #9881, or b) use jemalloc to get an accurate measure of how much allocated memory is actually being used.

Both of these are now done, so this issue is unblocked.

My suggestion of how to implement this is to have three knobs that can be twiddled (but with better names):

  1. A high_memory_limit config, which when the memory usage goes above this triggers us to start clearing the caches
  2. A low_target_memory config, where we will continue to evict caches until memory usage drops beneath this level
  3. A min_expiry_seconds config, where we will never evict cache entries newer than this.

Then it should just be a case of updating the looping call at:

async def _expire_old_entries(clock: Clock, expiry_seconds: int) -> None:

8dc1836 refactors the current jemalloc stuff to expose the necessary APIs to get the currently allocated memory (by calling get_stat("allocated")).

Note: we should only do this when using jemalloc.

@H-Shay
Copy link
Contributor

H-Shay commented Jun 6, 2022

With #12701 having landed, can we consider this issue complete?

@erikjohnston
Copy link
Member Author

Let's go with yes!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks.
Projects
None yet
Development

No branches or pull requests

3 participants