Skip to content
ltamasi edited this page Mar 31, 2022 · 73 revisions

Overview

BlobDB is essentially RocksDB for large-value use cases. The basic idea, which was proposed in the WiscKey paper, is key-value separation (see the figure below): by storing large values in dedicated blob files and storing only small pointers to them in the LSM tree, we avoid copying the values over and over again during compaction. This reduces write amplification, which has several potential benefits like improved SSD lifetime, and better write and read performance. On the other hand, this comes with the cost of some space amplification due to the presence of blobs that are no longer referenced by the LSM tree, which have to be garbage collected.

Design

⚠️ WARNING: There are two BlobDB implementations in the codebase: the legacy StackableDB based one (see rocksdb::blob_db::BlobDB) and the new integrated one (which uses the well-known rocksdb::DB interface). The legacy implementation is primarily geared towards FIFO/TTL use cases that can tolerate some data loss. It is incompatible with many widely used RocksDB features, for example, Merge, column families, checkpoints, backup/restore, transactions etc., and its performance is significantly worse than that of the integrated implementation. Note that the API for this version is not in the public header directory, it is not actively developed, and we expect it to be eventually deprecated. This page focuses on the new integrated BlobDB.

RocksDB's LSM tree works by buffering writes in memtables, which are then written to SST files during flush. SST files form a tree, and are continuously merged and rewritten in the background by compactions:

LSM

BlobDB changes this by writing large values to a dedicated set of blob files during flush/compaction. (Values smaller than a configurable threshold are stored in the LSM tree as usual.) Blobs written to blob files are then accessed via pointers (called blob references) stored in the LSM tree. Note that overwriting or deleting key-values results unreferenced/garbage blobs in the blob files. In order to be able to reclaim this space, BlobDB has garbage collection capabilities.

LSM_with_KV_separation

Offloading blob file building to RocksDB’s background jobs, i.e. flushes and compactions, has several advantages. It enables BlobDB to provide the same consistency guarantees as RocksDB itself. There are also several performance benefits:

  • Similarly to SSTs, any given blob file is written by a single background thread, which eliminates the need for synchronization.
  • Blob files can be written using large I/Os; there is no need to flush them after each write like in the case of the old BlobDB for example. This approach is also a better fit for network-based file systems where small writes might be expensive.
  • Compressing blobs in the background can improve latencies.
  • Blob files are immutable, which enables making blob files a part of the Version. This in turn makes the read-path essentially lock-free.
  • Similarly to SST files, blob files are sorted by key, which enables performance improvements like using readahead during compaction and iteration.
  • It opens up the possibility of file format optimizations that involve buffering (like dictionary compression).

Features

When it comes to functionality, the new BlobDB is near feature parity with vanilla RocksDB. In particular, it supports the following:

  • write APIs: Put, Merge, Delete, SingleDelete, DeleteRange, Write with all write options
  • read APIs: Get, MultiGet (including batched MultiGet), iterators, and GetMergeOperands
  • flush including atomic and manual flush
  • compaction (with integrated garbage collection), subcompactions, and the manual compaction APIs CompactFiles and CompactRange
  • WAL and the various recovery modes
  • tracking blob files in the MANIFEST
  • snapshots
  • per-blob compression and checksums (CRC32c)
  • column families
  • compaction filters (with a BlobDB-specific optimization)
  • checkpoints
  • backup/restore
  • transactions
  • per-file checksums
  • SST file manager integration for tracking and rate-limited deletion of blob files
  • blob file cache of frequently used blob files
  • statistics
  • DB properties
  • metadata APIs: GetColumnFamilyMetaData, GetAllColumnFamilyMetaData, and GetLiveFilesStorageInfo
  • EventListener interface
  • direct I/O
  • I/O rate limiting
  • I/O tracing
  • C and Java bindings

The BlobDB-specific aspects of some of these features are detailed below.

API

The new BlobDB can be configured (on a per-column family basis if needed) simply by using the following column family options:

  • enable_blob_files: set it to true to enable key-value separation.
  • min_blob_size: values at or above this threshold will be written to blob files during flush or compaction.
  • blob_file_size: the size limit for blob files.
  • blob_compression_type: the compression type to use for blob files. All blobs in the same file are compressed using the same algorithm.
  • enable_blob_garbage_collection: set this to true to make BlobDB actively relocate valid blobs from the oldest blob files as they are encountered during compaction.
  • blob_garbage_collection_age_cutoff: the cutoff that the GC logic uses to determine which blob files should be considered “old.” For example, the default value of 0.25 signals to RocksDB that blobs residing in the oldest 25% of blob files should be relocated by GC. This parameter can be tuned to adjust the trade-off between write amplification and space amplification.
  • blob_garbage_collection_force_threshold: if the ratio of garbage in the oldest blob files exceeds this threshold, targeted compactions are scheduled in order to force garbage collecting the blob files in question, assuming they are all eligible based on the value of blob_garbage_collection_age_cutoff above. This can help reduce space amplification in the case of skewed workloads where the affected files would not otherwise be picked up for compaction. This option is currently only supported with leveled compactions.
  • blob_compaction_readahead_size: when set, BlobDB will prefetch data from blob files in chunks of the configured size during compaction. This can improve compaction performance when the database resides on higher-latency storage like HDDs or remote filesystems.

The above options are all dynamically adjustable via the SetOptions API; changing them will affect subsequent flushes and compactions but not ones that are already in progress.

In terms of compaction styles, we recommend using leveled compaction with BlobDB. The rationale behind universal compaction in general is to provide lower write amplification at the expense of higher read amplification; however, according to our benchmarks, BlobDB can provide very low write amp and good read performance with leveled compaction. Therefore, there is really no reason to take the hit in read performance that comes with universal compaction.

In addition to the above, consider tuning the following non-BlobDB specific options:

  • write_buffer_size: this is the memtable size. You might want to increase it for large-value workloads to ensure that SST and blob files contain a decent number of keys.
  • target_file_size_base: the target size of SST files. Note that even when using BlobDB, it is important to have an LSM tree with a “nice” shape and multiple levels and files per level to prevent heavy compactions. Since BlobDB extracts and writes large values to blob files, it makes sense to make this parameter significantly smaller than the memtable size. One guideline is to set blob_file_size to the same value as write_buffer_size (adjusted for compression if needed) and make target_file_size_base proportionally smaller based on the ratio of key size to value size.
  • max_bytes_for_level_base: consider setting this to a multiple (e.g. 8x or 10x) of target_file_size_base.
  • compaction_readahead_size: this is the readahead size for SST files during compactions. Again, it might make sense to set this when the database is on slower storage.
  • writable_file_max_buffer_size: buffer size used when writing SST and blob files. Increasing it results in larger I/Os, which might be beneficial on certain types of storage.

Compaction filters

As mentioned above, BlobDB now also supports compaction filters. Key-value separation actually enables an optimization here: if the compaction filter of an application can make a decision about a key-value solely based on the key, it is unnecessary to read the value from the blob file. Applications can take advantage of this optimization by implementing the new FilterBlobByKey method of the CompactionFilter interface. This method gets called by RocksDB first whenever it encounters a key-value where the value is stored in a blob file. If this method returns a “final” decision like kKeep, kRemove, kChangeValue, or kRemoveAndSkipUntil, RocksDB will honor that decision; on the other hand, if the method returns kUndetermined, RocksDB will read the blob from the blob file and call FilterV2 with the value in the usual fashion.

Statistics

The integrated implementation supports the tickers BLOB_DB_BLOB_FILE_BYTES_{READ,WRITTEN}, BLOB_DB_BLOB_FILE_SYNCED, and BLOB_DB_GC_{NUM_KEYS,BYTES}_RELOCATED, as well as the histograms BLOB_DB_BLOB_FILE_{READ,WRITE,SYNC}_MICROS and BLOB_DB_(DE)COMPRESSION_MICROS. Note that the vast majority of the legacy BlobDB's tickers/histograms are not applicable to the new implementation, since they e.g. pertain to calling dedicated BlobDB APIs (which the integrated BlobDB does not have) or are tied to the legacy BlobDB's design of writing blob files synchronously when a write API is called. Such statistics are marked "legacy BlobDB only" in statistics.h.

DB properties

We support the following BlobDB-related properties:

  • rocksdb.num-blob-files: number of blob files in the current Version.
  • rocksdb.blob-stats: returns the total number and size of all blob files, as well as the total amount of garbage (in bytes) in the blob files in the current Version and the corresponding space amplification.
  • rocksdb.total-blob-file-size: the total size of all blob files aggregated across all Versions.
  • rocksdb.live-blob-file-size: the total size of all blob files in the current Version.
  • rocksdb.estimate-live-data-size: this is a non-BlobDB specific property that was extended to also consider the live data bytes residing in blob files (which can be computed exactly by subtracting garbage bytes from total bytes and summing over all blob files in the current Version).

Metadata APIs

For BlobDB, the ColumnFamilyMetaData structure has been extended with the following information:

  • a vector of BlobMetaData objects, one for each live blob file, which contain the file number, file name and path, file size, total number and size of all blobs in the file, total number and size of all garbage blobs in the file, as well as the file checksum method and checksum value.
  • the total number and size of all live blob files.

This information can be retrieved using the GetColumnFamilyMetaData API for any given column family. You can also retrieve a consistent view of all column families using the GetAllColumnFamilyMetaData API.

EventListener interface

We expose the following BlobDB-related information via the EventListener interface:

  • Job-level information: FlushJobInfo and CompactionJobInfo contain information about the blob files generated by flush and compaction jobs, respectively. Both structures contain a vector of BlobFileInfo objects corresponding to the newly generated blob files; in addition, CompactionJobInfo also contains a vector of BlobFileGarbageInfo structures that describe the additional amount of unreferenced garbage produced by the compaction job in question.
  • File-level information: RocksDB notifies the listener about events related to the lifecycle of any given blob file through the functions OnBlobFileCreationStarted, OnBlobFileCreated, and OnBlobFileDeleted.
  • Operation-level information: the OnFile*Finish notifications are also supported for blob files.

Future work

There is a couple of remaining features that are not yet supported by the new BlobDB; namely, we don’t currently support secondary instances and ingestion of blob files. We will continue to work on closing this gap.

We also have further plans when it comes to performance. These include optimizing garbage collection, introducing a dedicated cache for blobs, improving iterator performance, and evolving the blob file format amongst others.

Contents

Clone this wiki locally