Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid copies when freezing and writing layers to disk #657

Closed
patins opened this issue Sep 24, 2021 · 6 comments
Closed

Avoid copies when freezing and writing layers to disk #657

patins opened this issue Sep 24, 2021 · 6 comments
Labels
c/storage/pageserver Component: storage: pageserver

Comments

@patins
Copy link
Contributor

patins commented Sep 24, 2021

We copy the segsizes and page_versions BTreeMaps during both of these processes.

Instead, maybe we can move these structs into an Arc, and share the Arc + valid range of LSN to users of the data

@hlinnaka
Copy link
Contributor

Yeah, copying the BTreeMaps is pretty expensive. Looking 'perf' profile, it incurs all kinds of overheads that add up:

  • BTreeMap::insert() shows up in the profile. I'm surprised there doesn't seem to be a way to construct the BTreeMap in a more efficient way; the input is already sorted so with some smarts building the BTreeMap should be a lot faster than it is.
  • There's malloc/free overhead
  • Because we're storing WALRecords and page images that use reference-counted Bytes buffers, copying requires bumping the reference counts on every page version. And when the BTreeMap is dropped, all the reference counters need to decreased again. The drop functions show up in profile

@hlinnaka
Copy link
Contributor

I actually spent some time exploring this yesterday, and came up with this pretty simple optimization: https:/zenithdb/zenith/tree/reduce-treemap-copying. It eliminates one of the expensive BTreeMap copies, when new DeltaLayer is created. Instead of passing the page versions as a BTreeMap, the DeltaLayer::create function now takes an Iterator. It needs some polishing and comments, but it's a pretty small change with big impact.

That doesn't eliminate all the BTreeMap copies and overhead, though, there's more that could be done.

@hlinnaka
Copy link
Contributor

Some other things I've been exploring and thinking, but don't have a patch yet:

  • We should switch to using an "arena allocator", like @funbringer (?) suggested yesterday. That would eliminate much of the malloc/free overhead. And I believe it's required anyway for tracking the memory usage (Track memory usage #656).
  • In order to use an arena allocator, we need to move away from the reference-counted Bytes buffers in the page versions.
  • We also need to replace BTreeMap with something else that can use a custom allocator. Or rely on rust nightly, which adds support for custom allocator. But I think we can write something specific to our use case from scatch pretty easily, and will probably be faster than the generic BTreeMap anyway
  • If we stop using reference counted Bytes, we will need to copy the page versions when they're returned from the layer. Or, we tie the return value of e.g. get_page_reconstruct_data to the lifetime of the Layer reference, so that the returned reconstruct data is valid for as long as you have a reference to the Layer.
  • That probably means that LayerMap needs to return some kind of a Guard object to the layer, rather than Arc. Once we have that, we can probably use that to replace the current retry-loop on write-operations; if you get a write-lease on a Layer from LayerMap, that lease can prevent the Layer from being dropped, so you don't need the check for whether it's still writeable and the retry loop.

And:

  • Kind of opposite to the above line of thining, but we could avoid one copy of the BTreeMaps if we added a retry-loop for the read-paths as well. I'm thinking that freeze() would mark the layer as a tombstone that's no-longer valid. If you try to read from a tombstone layer, you need to retry.

@LizardWizzard
Copy link
Contributor

LizardWizzard commented Sep 24, 2021

Just a few notes:

We should switch to using an "arena allocator".
And I believe it's required anyway for tracking the memory usage

If I understand it correctly tracking is not directly connected to arena allocator, so it can be any allocator with internal tracking. And it is simple to build our own one which wraps default allocator and records allocations (or take something from crates.io).

Or rely on rust nightly, which adds support for custom allocator.

Even on nightly custom allocator for BTreeMap is not available yet, see rust-lang/wg-allocators#7

Also see this discord thread: https://discord.com/channels/869525774699462656/890916952489484298/890916955488419900

@LizardWizzard
Copy link
Contributor

Also for a fixed size data like pages we can leverage slab allocation (for example https:/tokio-rs/slab)

@hlinnaka hlinnaka added the c/storage/pageserver Component: storage: pageserver label Oct 7, 2021
@hlinnaka
Copy link
Contributor

This hasn't been a signficant performance issue for a while now. All the ideas for improvements listed here still probably make sense, but the first thing would be to do some profiling to see where exactly the CPU time is spent nowadays, and decide what to do based on that. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

No branches or pull requests

3 participants