Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BlobDB Caching #10156

Closed
13 of 14 tasks
gangliao opened this issue Jun 13, 2022 · 8 comments
Closed
13 of 14 tasks

BlobDB Caching #10156

gangliao opened this issue Jun 13, 2022 · 8 comments

Comments

@gangliao
Copy link
Contributor

gangliao commented Jun 13, 2022

I want to use this git issue to track each task for BlobDB Caching since we plan to split each task into multiple PRs to make code review more straightforward and explicit.

Integrate caching into the blob read logic

In contrast with block-based tables, which can utilize RocksDB's block cache (see https:/facebook/rocksdb/wiki/Block-Cache), there is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache.

Clean up Version::MultiGetBlob() and move 'blob'-related code snippets into MultiGetBlob. Also, add a new API in BlobSource. More context from: #10225.

- Version::MultiGetBlob(...) // multiple files multiple blobs
  -> BlobSource::MultiGetBlob()  // multiple files multiple blobs
    -> BlobSource::MultiGetBlobFromOneFile() // one file, multiple blobs

By definition, BlobSource also has information about multiple blob files, thus we can push the logic into this layer.

Add the blob cache to the stress tests and the benchmarking tool

In order to facilitate correctness and performance testing, we would like to add the new blob cache to our stress test tool db_stress and our continuously running crash test script db_crashtest.py, as well as our synthetic benchmarking tool db_bench and the BlobDB performance testing script run_blob_bench.sh. As part of this task, we would also like to utilize these benchmarking tools to get some initial performance numbers about the effectiveness of caching blobs.

Add blob cache tickers, perf context statistics, and DB properties

In order to be able to monitor the performance of the new blob cache, we made the follow changes:

Charge blob cache usage against the global memory limit

To help service owners to manage their memory budget effectively, we have been working towards counting all major memory users inside RocksDB towards a single global memory limit (see e.g. https:/facebook/rocksdb/wiki/Write-Buffer-Manager#cost-memory-used-in-memtable-to-block-cache). The global limit is specified by the capacity of the block-based table's block cache, and is technically implemented by inserting dummy entries ("reservations") into the block cache. The goal of this task is to support charging the memory usage of the new blob cache against this global memory limit when the backing cache of the blob cache and the block cache are different.

Eliminate the copying of blobs when serving reads from the cache

The blob cache enables an optimization on the read path: when a blob is found in the cache, we can avoid copying it into the buffer provided by the application. Instead, we can simply transfer ownership of the cache handle to the target PinnableSlice. (Note: this relies on the Cleanable interface, which is implemented by PinnableSlice.) This has the potential to save a lot of CPU, especially with large blob values.

Support prepopulating/warming the blob cache

Many workloads have temporal locality, where recently written items are read back in a short period of time. When using remote file systems, this is inefficient since it involves network traffic and higher latencies. Because of this, we would like to support prepopulating the blob cache during flush.

Add a blob-specific cache priority

RocksDB's Cache abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them.

Support using secondary cache with the blob cache

RocksDB supports a two-level cache hierarchy (see https://rocksdb.org/blog/2021/05/27/rocksdb-secondary-cache.html), where items evicted from the primary cache can be spilled over to the secondary cache, or items from the secondary cache can be promoted to the primary one. We have a CacheLib-based non-volatile secondary cache implementation that can be used to improve read latencies and reduce the amount of network bandwidth when using distributed file systems. In addition, we have recently implemented a compressed secondary cache that can be used as a replacement for the OS page cache when e.g. direct I/O is used.


Support an improved/global limit on BlobDB's space amp

BlobDB currently supports limiting space amplification via the configuration option blob_garbage_collection_force_threshold. It works by computing the ratio of garbage (i.e. garbage bytes divided by total bytes) over the oldest batch of blob files, and if the ratio exceeds the specified threshold, it triggers a special type of compaction targeting the SST files that point to the blob files in question. (There is a coarse mapping between SSTs and blob files, which we track in the MANIFEST.)

This existing option can be difficult to use or tune. There are (at least) two challenges:

(1). The occupancy of blob files is not uniform: older blob files tend to have more garbage, so if a service owner has a specific space amp goal, it is far from obvious what value they should set for blob_garbage_collection_force_threshold.
(2). BlobDB keeps track of the exact amount of garbage in blob files, which enables us to compute the blob files' "space amp" precisely. Even though it's an exact value, there is a disconnect between this metric and people's expectations regarding space amp. The problem is that while people tend to think of LSM tree space amp as the ratio between the total size of the DB and the total size of the live/current KVs, for the purposes of blob space amp, a blob is only considered garbage once the corresponding blob reference has already been compacted out from the LSM tree. (One could say the the LSM tree space amp notion described above is "logical", while the blob one is "physical".)

To make the users' lives easier and solve (1), we would want to add a new configuration option (working title: blob_garbage_collection_space_amp_limit) that would enable customers to directly set a space amp target (as opposed to a per-blob-file-batch garbage threshold). To bridge the gap between the above notion of LSM tree space amp and the blob space amp (2), we would want this limit to apply to the entire data structure/database (the LSM tree plus the blob files). Note that this will necessarily be an estimate, since we don't know exactly how much space the obsolete KVs take up in the LSM tree. One simple idea would be to take the reciprocal of the LSM tree space amp estimated using the method of VersionStorageInfo::EstimateLiveDataSize, and scale the number of live blob bytes using the same factor.

Example: let's say the LSM tree space amp is 1.5, which means that the live KVs take up two thirds of the LSM. Then, we can use the same 2/3 factor to multiply the value of (total blob bytes - garbage blob bytes) to get an estimate of the live blob bytes from the user's perspective.

Note: if the above limit is breached, we would still want to do the same thing as in the case of blob_garbage_collection_force_threshold, i.e. force-compact the SSTs pointing to the oldest blob files (potentially repeatedly, until the limit is satisfied).

facebook-github-bot pushed a commit that referenced this issue Jun 14, 2022
Summary:
There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache.
This PR is a part of #10156

Pull Request resolved: #10155

Reviewed By: ltamasi

Differential Revision: D37150819

Pulled By: gangliao

fbshipit-source-id: b807c7916ea5d411588128f8e22a49f171388fe2
facebook-github-bot pushed a commit that referenced this issue Jun 17, 2022
Summary:
There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache.
In this task, we added a new abstraction layer `BlobSource` to retrieve blobs from either blob cache or raw blob file. Note: For simplicity, the current PR only includes `GetBlob()`.  `MultiGetBlob()` will be included in the next PR.

This PR is a part of #10156

Pull Request resolved: #10178

Reviewed By: ltamasi

Differential Revision: D37250507

Pulled By: gangliao

fbshipit-source-id: 3fc4a55a0cea955a3147bdc7dba06430e377259b
gangliao added a commit to gangliao/rocksdb that referenced this issue Jun 20, 2022
Summary:

To help service owners to manage their memory budget effectively, we have been working towards counting all major memory users inside RocksDB towards a single global memory limit (see e.g. https:/facebook/rocksdb/wiki/Write-Buffer-Manager#cost-memory-used-in-memtable-to-block-cache). The global limit is specified by the capacity of the block-based table's block cache, and is technically implemented by inserting dummy entries ("reservations") into the block cache. The goal of this task is to support charging the memory usage of the new blob cache against this global memory limit when the backing cache of the blob cache and the block cache are different.

Test Plan:

Reviewers:

Subscribers:

Tasks:

This PR is a part of facebook#10156

Tags:
facebook-github-bot pushed a commit that referenced this issue Jun 21, 2022
Summary:
There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache.
In this task, we formally introduced the blob source to RocksDB.  BlobSource is a new abstraction layer that provides universal access to blobs, regardless of whether they are in the blob cache, secondary cache, or (remote) storage. Depending on user settings, it always fetch blobs from multi-tier cache and storage with minimal cost.

Note: The new `MultiGetBlob()` implementation is not included in the current PR. To go faster, we aim to create a separate PR for it in parallel!

This PR is a part of #10156

Pull Request resolved: #10198

Reviewed By: ltamasi

Differential Revision: D37294735

Pulled By: gangliao

fbshipit-source-id: 9cb50422d9dd1bc03798501c2778b6c7520c7a1e
@gangliao
Copy link
Contributor Author

gangliao commented Jun 21, 2022

Potential Bug

facebook-github-bot pushed a commit that referenced this issue Jun 22, 2022
)

Summary:
In order to facilitate correctness and performance testing, we would like to add the new blob cache to our stress test tool `db_stress` and our continuously running crash test script `db_crashtest.py`, as well as our synthetic benchmarking tool `db_bench` and the BlobDB performance testing script `run_blob_bench.sh`.
As part of this task, we would also like to utilize these benchmarking tools to get some initial performance numbers about the effectiveness of caching blobs.

This PR is a part of #10156

Pull Request resolved: #10202

Reviewed By: ltamasi

Differential Revision: D37325739

Pulled By: gangliao

fbshipit-source-id: deb65d0d414502270dd4c324d987fd5469869fa8
facebook-github-bot pushed a commit that referenced this issue Jun 23, 2022
Summary:
There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache.

Pull Request resolved: #10225

Test Plan:
Add test cases for MultiGetBlob
In this task, we added the new API MultiGetBlob() for BlobSource.

This PR is a part of #10156

Reviewed By: ltamasi

Differential Revision: D37358364

Pulled By: gangliao

fbshipit-source-id: aff053a37615d96d768fb9aedde17da5618c7ae6
facebook-github-bot pushed a commit that referenced this issue Jun 28, 2022
…10203)

Summary:
In order to be able to monitor the performance of the new blob cache, we made the follow changes:
- Add blob cache hit/miss/insertion tickers (see https:/facebook/rocksdb/wiki/Statistics)
- Extend the perf context similarly (see https:/facebook/rocksdb/wiki/Perf-Context-and-IO-Stats-Context)
- Implement new DB properties (see e.g. https:/facebook/rocksdb/blob/main/include/rocksdb/db.h#L1042-L1051) that expose the capacity and current usage of the blob cache.

This PR is a part of #10156

Pull Request resolved: #10203

Reviewed By: ltamasi

Differential Revision: D37478658

Pulled By: gangliao

fbshipit-source-id: d8ee3f41d47315ef725e4551226330b4b6832e40
gangliao added a commit to gangliao/rocksdb that referenced this issue Jun 28, 2022
Summary:

- Enabled blob caching for MultiGetBlob in RocksDB
- Refactored MultiGetBlob logic and interface in RocksDB
- Cleaned up Version::MultiGetBlob() and moved 'blob'-related code snippets into BlobSource

This task is a part of facebook#10156
facebook-github-bot pushed a commit that referenced this issue Jun 30, 2022
Summary:
- [x] Enabled blob caching for MultiGetBlob in RocksDB
- [x] Refactored MultiGetBlob logic and interface in RocksDB
- [x] Cleaned up Version::MultiGetBlob() and moved 'blob'-related code snippets into BlobSource
- [x] Add End-to-end test cases in db_blob_basic_test and also add unit tests in blob_source_test

This task is a part of #10156

Pull Request resolved: #10272

Reviewed By: ltamasi

Differential Revision: D37558112

Pulled By: gangliao

fbshipit-source-id: a73a6a94ffdee0024d5b2a39e6d1c1a7d38664db
facebook-github-bot pushed a commit that referenced this issue Jul 7, 2022
)

Summary:
The blob cache enables an optimization on the read path: when a blob is found in the cache, we can avoid copying it into the buffer provided by the application. Instead, we can simply transfer ownership of the cache handle to the target `PinnableSlice`. (Note: this relies on the `Cleanable` interface, which is implemented by `PinnableSlice`.)

This has the potential to save a lot of CPU, especially with large blob values.

This task is a part of #10156

Pull Request resolved: #10297

Reviewed By: riversand963

Differential Revision: D37640311

Pulled By: gangliao

fbshipit-source-id: 92de0e35cc703d06c87c5c1861cc2899ec52234a
facebook-github-bot pushed a commit that referenced this issue Jul 9, 2022
Summary:
Update HISTORY.md for blob cache.  Implementation can be found from Github issue #10156 (or Github PRs #10155, #10178, #10225, #10198, and #10272).

Pull Request resolved: #10328

Reviewed By: riversand963

Differential Revision: D37732514

Pulled By: gangliao

fbshipit-source-id: 4c942a41c07914bfc8db56a0d3cf4d3e53d5963f
@cavallium
Copy link
Contributor

Is it planned to support the blob cache option in rocksdbjni?

@gangliao
Copy link
Contributor Author

@cavallium Currently, we have a MVP now. we will support it soon.

facebook-github-bot pushed a commit that referenced this issue Jul 16, 2022
Summary:
RocksDB supports a two-level cache hierarchy (see https://rocksdb.org/blog/2021/05/27/rocksdb-secondary-cache.html), where items evicted from the primary cache can be spilled over to the secondary cache, or items from the secondary cache can be promoted to the primary one. We have a CacheLib-based non-volatile secondary cache implementation that can be used to improve read latencies and reduce the amount of network bandwidth when using distributed file systems. In addition, we have recently implemented a compressed secondary cache that can be used as a replacement for the OS page cache when e.g. direct I/O is used. The goals of this task are to add support for using a secondary cache with the blob cache and to measure the potential performance gains using `db_bench`.

This task is a part of #10156

Pull Request resolved: #10349

Reviewed By: ltamasi

Differential Revision: D37896773

Pulled By: gangliao

fbshipit-source-id: 7804619ce4a44b73d9e11ad606640f9385969c84
facebook-github-bot pushed a commit that referenced this issue Jul 17, 2022
Summary:
Many workloads have temporal locality, where recently written items are read back in a short period of time. When using remote file systems, this is inefficient since it involves network traffic and higher latencies. Because of this, we would like to support prepopulating the blob cache during flush.

This task is a part of #10156

Pull Request resolved: #10298

Reviewed By: ltamasi

Differential Revision: D37908743

Pulled By: gangliao

fbshipit-source-id: 9feaed234bc719d38f0c02975c1ad19fa4bb37d1
gangliao added a commit to gangliao/rocksdb that referenced this issue Jul 19, 2022
Summary:

To help service owners to manage their memory budget effectively, we have been working towards counting all major memory users inside RocksDB towards a single global memory limit (see e.g. https:/facebook/rocksdb/wiki/Write-Buffer-Manager#cost-memory-used-in-memtable-to-block-cache). The global limit is specified by the capacity of the block-based table's block cache, and is technically implemented by inserting dummy entries ("reservations") into the block cache. The goal of this task is to support charging the memory usage of the new blob cache against this global memory limit when the backing cache of the blob cache and the block cache are different.

This PR is a part of facebook#10156
facebook-github-bot pushed a commit that referenced this issue Jul 19, 2022
Summary:
To help service owners to manage their memory budget effectively, we have been working towards counting all major memory users inside RocksDB towards a single global memory limit (see e.g. https:/facebook/rocksdb/wiki/Write-Buffer-Manager#cost-memory-used-in-memtable-to-block-cache). The global limit is specified by the capacity of the block-based table's block cache, and is technically implemented by inserting dummy entries ("reservations") into the block cache. The goal of this task is to support charging the memory usage of the new blob cache against this global memory limit when the backing cache of the blob cache and the block cache are different.

This PR is a part of #10156

Pull Request resolved: #10321

Reviewed By: ltamasi

Differential Revision: D37913590

Pulled By: gangliao

fbshipit-source-id: eaacf23907f82dc7d18964a3f24d7039a2937a72
facebook-github-bot pushed a commit that referenced this issue Jul 28, 2022
Summary:
RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them.

This task is a part of #10156

Pull Request resolved: #10309

Reviewed By: ltamasi

Differential Revision: D38211655

Pulled By: gangliao

fbshipit-source-id: 65ef33337db4d85277cc6f9782d67c421ad71dd5
gangliao added a commit to gangliao/rocksdb that referenced this issue Aug 2, 2022
Summary:
RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them.

This task is a part of facebook#10156
facebook-github-bot pushed a commit that referenced this issue Aug 13, 2022
Summary:
RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them.

This task is a part of #10156

Pull Request resolved: #10461

Reviewed By: siying

Differential Revision: D38672823

Pulled By: ltamasi

fbshipit-source-id: 90cf7362036563d79891f47be2cc24b827482743
@ltamasi
Copy link
Contributor

ltamasi commented Jan 4, 2023

Thanks so much for implementing this feature @gangliao !

@ltamasi ltamasi closed this as completed Jan 4, 2023
@gangliao
Copy link
Contributor Author

gangliao commented Jan 4, 2023

Thank you for your mentorship. :)))

Connor1996 pushed a commit to Connor1996/rocksdb that referenced this issue Jan 29, 2024
Summary:
RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them.

This task is a part of facebook#10156

Pull Request resolved: facebook#10461

Reviewed By: siying

Differential Revision: D38672823

Pulled By: ltamasi

fbshipit-source-id: 90cf7362036563d79891f47be2cc24b827482743
Connor1996 added a commit to tikv/rocksdb that referenced this issue Feb 1, 2024
* Add a blob-specific cache priority (facebook#10461)

Summary:
RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them.

This task is a part of facebook#10156

Pull Request resolved: facebook#10461

Reviewed By: siying

Differential Revision: D38672823

Pulled By: ltamasi

fbshipit-source-id: 90cf7362036563d79891f47be2cc24b827482743

* make format

Signed-off-by: Connor1996 <[email protected]>

* make format

Signed-off-by: Connor1996 <[email protected]>

---------

Signed-off-by: Connor1996 <[email protected]>
Co-authored-by: Gang Liao <[email protected]>
@mo-avatar
Copy link

mo-avatar commented May 21, 2024

@gangliao Will blob cache be automaticlly used when we use the traditional Get interface in db.h or we have to use GetBlob in db/blob/blob_source.h to get blob cache to work? thanks!

@gangliao
Copy link
Contributor Author

@mo-avatar When tackling something new, diving into the unit tests is always a good strategy!

Options options = GetDefaultOptions();
LRUCacheOptions co;
co.capacity = 2 << 20; // 2MB
co.num_shard_bits = 2;
co.metadata_charge_policy = kDontChargeCacheMetadata;
auto backing_cache = NewLRUCache(co);
options.enable_blob_files = true;
options.blob_cache = backing_cache;
BlockBasedTableOptions block_based_options;
block_based_options.no_block_cache = false;
block_based_options.block_cache = backing_cache;
block_based_options.cache_index_and_filter_blocks = true;
options.table_factory.reset(NewBlockBasedTableFactory(block_based_options));

@mo-avatar
Copy link

options.blob_cache = backing_cache;

Thanks for your time and help, I'll read the test to figure out how it works。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants