Avoid copies when freezing and writing layers to disk #657

patins · 2021-09-24T15:19:19Z

We copy the segsizes and page_versions BTreeMaps during both of these processes.

Instead, maybe we can move these structs into an Arc, and share the Arc + valid range of LSN to users of the data

hlinnaka · 2021-09-24T16:55:10Z

Yeah, copying the BTreeMaps is pretty expensive. Looking 'perf' profile, it incurs all kinds of overheads that add up:

BTreeMap::insert() shows up in the profile. I'm surprised there doesn't seem to be a way to construct the BTreeMap in a more efficient way; the input is already sorted so with some smarts building the BTreeMap should be a lot faster than it is.
There's malloc/free overhead
Because we're storing WALRecords and page images that use reference-counted Bytes buffers, copying requires bumping the reference counts on every page version. And when the BTreeMap is dropped, all the reference counters need to decreased again. The drop functions show up in profile

hlinnaka · 2021-09-24T16:58:27Z

I actually spent some time exploring this yesterday, and came up with this pretty simple optimization: https:/zenithdb/zenith/tree/reduce-treemap-copying. It eliminates one of the expensive BTreeMap copies, when new DeltaLayer is created. Instead of passing the page versions as a BTreeMap, the DeltaLayer::create function now takes an Iterator. It needs some polishing and comments, but it's a pretty small change with big impact.

That doesn't eliminate all the BTreeMap copies and overhead, though, there's more that could be done.

hlinnaka · 2021-09-24T17:08:52Z

Some other things I've been exploring and thinking, but don't have a patch yet:

We should switch to using an "arena allocator", like @funbringer (?) suggested yesterday. That would eliminate much of the malloc/free overhead. And I believe it's required anyway for tracking the memory usage (Track memory usage #656).
In order to use an arena allocator, we need to move away from the reference-counted Bytes buffers in the page versions.
We also need to replace BTreeMap with something else that can use a custom allocator. Or rely on rust nightly, which adds support for custom allocator. But I think we can write something specific to our use case from scatch pretty easily, and will probably be faster than the generic BTreeMap anyway
If we stop using reference counted Bytes, we will need to copy the page versions when they're returned from the layer. Or, we tie the return value of e.g. get_page_reconstruct_data to the lifetime of the Layer reference, so that the returned reconstruct data is valid for as long as you have a reference to the Layer.
That probably means that LayerMap needs to return some kind of a Guard object to the layer, rather than Arc. Once we have that, we can probably use that to replace the current retry-loop on write-operations; if you get a write-lease on a Layer from LayerMap, that lease can prevent the Layer from being dropped, so you don't need the check for whether it's still writeable and the retry loop.

And:

Kind of opposite to the above line of thining, but we could avoid one copy of the BTreeMaps if we added a retry-loop for the read-paths as well. I'm thinking that freeze() would mark the layer as a tombstone that's no-longer valid. If you try to read from a tombstone layer, you need to retry.

LizardWizzard · 2021-09-24T17:50:16Z

Just a few notes:

We should switch to using an "arena allocator".
And I believe it's required anyway for tracking the memory usage

If I understand it correctly tracking is not directly connected to arena allocator, so it can be any allocator with internal tracking. And it is simple to build our own one which wraps default allocator and records allocations (or take something from crates.io).

Or rely on rust nightly, which adds support for custom allocator.

Even on nightly custom allocator for BTreeMap is not available yet, see rust-lang/wg-allocators#7

Also see this discord thread: https://discord.com/channels/869525774699462656/890916952489484298/890916955488419900

LizardWizzard · 2021-09-24T18:22:55Z

Also for a fixed size data like pages we can leverage slab allocation (for example https:/tokio-rs/slab)

hlinnaka · 2022-02-28T15:30:14Z

This hasn't been a signficant performance issue for a while now. All the ideas for improvements listed here still probably make sense, but the first thing would be to do some profiling to see where exactly the CPU time is spent nowadays, and decide what to do based on that. Closing.

patins added the layered-repo label Sep 24, 2021

hlinnaka mentioned this issue Sep 24, 2021

Reduce BTreeMap copying in write_to_disk #662

Merged

hlinnaka added the c/storage/pageserver Component: storage: pageserver label Oct 7, 2021

hlinnaka closed this as completed Feb 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid copies when freezing and writing layers to disk #657

Avoid copies when freezing and writing layers to disk #657

patins commented Sep 24, 2021

hlinnaka commented Sep 24, 2021

hlinnaka commented Sep 24, 2021

hlinnaka commented Sep 24, 2021

LizardWizzard commented Sep 24, 2021 •

edited

Loading

LizardWizzard commented Sep 24, 2021

hlinnaka commented Feb 28, 2022

Avoid copies when freezing and writing layers to disk #657

Avoid copies when freezing and writing layers to disk #657

Comments

patins commented Sep 24, 2021

hlinnaka commented Sep 24, 2021

hlinnaka commented Sep 24, 2021

hlinnaka commented Sep 24, 2021

LizardWizzard commented Sep 24, 2021 • edited Loading

LizardWizzard commented Sep 24, 2021

hlinnaka commented Feb 28, 2022

LizardWizzard commented Sep 24, 2021 •

edited

Loading