Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Synapse media repository workers seem unbounded in memory usage, causing high GC times #8653

Closed
michaelkaye opened this issue Oct 26, 2020 · 7 comments
Labels
A-Performance Performance, both client-facing and admin-facing z-bug (Deprecated Label)

Comments

@michaelkaye
Copy link
Contributor

Currently we seem to be unbounded in memory usage on the media repository. This doesn't appear to be cache related (i'm not even sure what we would cache). Having added 3 media_repo workers we're now tripling this overhead.

For what they do, I don't believe they should be using this much (seen up to 24Gb * 3) memory, and they have a very clean sawtooth profile for memory usage based on restarts

memory_usage

This memory is making its way into gen2 and so GC times are starting to take up to 50s when we near the full memory size.

Could be a leak in some thumbnailing code or something?

@anoadragon453
Copy link
Member

This looks to have been a problem for multiple months on matrix.org now, and is not a recent regression afaik.

@anoadragon453 anoadragon453 added z-bug (Deprecated Label) A-Performance Performance, both client-facing and admin-facing p1 labels Oct 26, 2020
@michaelkaye
Copy link
Contributor Author

Is this something we're investigating now?

@anoadragon453
Copy link
Member

Not actively. I've added it to the board so we don't forget about it.

@erikjohnston
Copy link
Member

Next steps:

  1. Use pympler in a manhole to see if there are any obvious objects that are leaking (after having left it to run for a while).
  2. Check if GC gen 2 is taking as long as expected. If not try disabling Jaeger/opentracing for the media repos to see if that helps (we might be leaking in native land rather than Python land)

@turt2live
Copy link
Member

There also seems to be a matching file descriptor leak on this - maybe it's holding files in memory or something?

@erikjohnston
Copy link
Member

erikjohnston commented Jan 27, 2021

We seem to be leaking sockets in CLOSE_WAIT:

$ lsof -n -p 54367 | grep -Eo "\(.*\)$" | sort | uniq -c
   1229 (CLOSE_WAIT)
    190 (ESTABLISHED)
      3 (LISTEN)

@richvdh
Copy link
Member

richvdh commented Feb 24, 2021

This seems much improved now (presumably thanks to #9421):

image

We still have a sawtooth on the "open FDs": raising that as #9488

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Performance Performance, both client-facing and admin-facing z-bug (Deprecated Label)
Projects
None yet
Development

No branches or pull requests

5 participants