feat(storage): suppport disk object store #2389

wenym1 · 2022-05-09T10:18:01Z

What's changed and what's your intention?

In this PR, we implement a disk object store for future spill to disk support. The local file system is used to maintain the path hierarchy and store the object data.

Checklist

I have written necessary docs and comments
I have added necessary unit tests and integration tests

Refer to a related PR or issue link (optional)

part of #2384

codecov · 2022-05-09T10:27:23Z

Codecov Report

Merging #2389 (e15caf5) into main (965e671) will increase coverage by 0.73%.
The diff coverage is 82.98%.

@@            Coverage Diff             @@
##             main    #2389      +/-   ##
==========================================
+ Coverage   71.31%   72.04%   +0.73%     
==========================================
  Files         688      677      -11     
  Lines       86790    87988    +1198     
==========================================
+ Hits        61897    63394    +1497     
+ Misses      24893    24594     -299

Flag	Coverage Δ
rust	`72.04% <82.98%> (+0.73%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/storage/compactor/src/server.rs	`0.00% <0.00%> (ø)`
src/storage/src/object/mod.rs	`13.63% <0.00%> (-1.75%)`	⬇️
src/storage/src/object/s3.rs	`0.00% <0.00%> (ø)`
src/storage/src/store_impl.rs	`5.79% <0.00%> (-0.99%)`	⬇️
src/storage/src/object/mem.rs	`78.21% <42.85%> (-0.14%)`	⬇️
src/storage/src/hummock/sstable_store.rs	`65.92% <50.00%> (+4.85%)`	⬆️
src/storage/src/hummock/block_cache.rs	`80.59% <55.55%> (+7.73%)`	⬆️
src/storage/src/hummock/cache.rs	`95.78% <88.00%> (+0.27%)`	⬆️
src/storage/src/object/disk.rs	`94.76% <94.76%> (ø)`
src/storage/src/object/error.rs	`38.09% <100.00%> (+15.87%)`	⬆️
... and 216 more

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

skyzh · 2022-05-09T16:42:40Z

Do we have plan to support multiple object store backends for Hummock? If we want to spill to disk, I guess we are running two backends. I'm also thinking whether disk should be an object store. Maybe we only need to have disk interface for spill-to-disk, without making it a kind of object store?

wenym1 · 2022-05-10T02:55:05Z

Do we have plan to support multiple object store backends for Hummock? If we want to spill to disk, I guess we are running two backends. I'm also thinking whether disk should be an object store.

Yes, in my development plan, we will have two object store, one for local and one for remote. The sstable table store will hold the two object store and do the routing according to the highest bit of the sstable id. A more progressed PR is #2384, and since it's too large, I split in into several smaller PR like this one.

Maybe we only need to have disk interface for spill-to-disk, without making it a kind of object store?

Since the spilled files will be in SST format, making it an object store can share some logic when we read SST and make the code neater. In future PR I will introduce a stream-like uploader for object store so that we don't have to buffer the whole SST in memory when we are using disk object store.

skyzh · 2022-05-10T02:55:52Z

we will have two object store

Looks good!

skyzh · 2022-05-10T02:57:26Z

src/storage/src/object/disk.rs

+ use crate::object::{ObjectError, ObjectResult};
+
+ pub async fn ensure_file_dir_exists(path: &Path) -> ObjectResult<()> {
+ if let Some(dir) = path.parent() {


Should also assert path is not already a folder?

I think it's not necessary. If the current path is a folder, in the subsequent open it will fail.

skyzh · 2022-05-10T02:59:21Z

src/storage/src/object/disk.rs

+ Ok(())
+ }
+
+ async fn read(&self, path: &str, block_loc: Option<BlockLocation>) -> ObjectResult<Bytes> {


I guess this would be very inefficient for disk object stores, and would make file handler boosting to a very high level, which would easily exceed file handler limit on most platforms (e.g. 256 on macOS).

For disk object store, I would recommend having a cache of opened files.

Do you mean after we finish using a file handle we recycle it in a cache instead of closing it?

Yes. Also we should use a single file handler (file object) for a file, and then avoid using mutex around them. If there are multiple requests to a file, we should use pread to do positioned read.

+1.
But it seems that tokio does not support pread.

skyzh · 2022-05-10T03:02:53Z

src/storage/src/store_impl.rs

@@ -117,6 +121,15 @@ impl StateStoreImpl {
 state_store_stats.clone(),
 );
 in_mem_object_store.set_compactor_shutdown_sender(shutdown_sender);
+ } else if let ObjectStoreImpl::Disk(disk_object_store) = object_store.as_ref() {
+ tracing::info!("start a compactor for local disk object store");
+ let (_, shutdown_sender) = Compactor::start_compactor(


Shall we limit the capability of this compactor? e.g., only allow it to compact files on local disk.

The compactor is unaware of where the file is coming from. It just fetches table id from meta and data from object store. The code here is only for probably using disk object store as the remote ground true object, similar to using in memory object store.

Little-Wallace · 2022-05-10T08:01:41Z

src/storage/src/object/disk.rs

+ use std::fs::Metadata;
+ use std::path::Path;
+
+ use tokio::fs::{create_dir_all, File, OpenOptions};


Could we have a benchmark for tokio::fs and unix::fs ? For both bandwidth and latency.

Just added a simple bench. The bench runs serially in a single thread. The bench result are as followed. Each r/w is operating on 1MB data.

fs_operation/tokio time: [124.42 ms 126.53 ms 128.76 ms] tokio stat: Write 1310 times, avg 5.948854961832061ms. Read 1310 times, avg 4.315267175572519ms fs_operation/std time: [27.044 ms 27.805 ms 28.629 ms] std stat: Write 3270 times, avg 0.0027522935779816515ms. Read 3270 times, avg 0.046788990825688076ms

Do we have plan to support multiple object store backends for Hummock? If we want to spill to disk, I guess we are running two backends. I'm also thinking whether disk should be an object store.

Yes, in my development plan, we will have two object store, one for local and one for remote. The sstable table store will hold the two object store and do the routing according to the highest bit of the sstable id. A more progressed PR is #2384, and since it's too large, I split in into several smaller PR like this one.

Maybe we only need to have disk interface for spill-to-disk, without making it a kind of object store?

Since the spilled files will be in SST format, making it an object store can share some logic when we read SST and make the code neater. In future PR I will introduce a stream-like uploader for object store so that we don't have to buffer the whole SST in memory when we are using disk object store.

Would we implement it as a hybrid-storage? Just like following:

pub struct HybridObjectStore { local: Box<dyn ObjectStore>, remote: Box<dyn ObjectStore>, } impl ObjectStore for HybridObjectStore { }

I have thought about it. If we implement it in this way, in the current ObjectStore interface, I can't find an elegant way to specify which object store to use. In the future, such HybridObjectStore can be used with cache semantic, i.e. we first lookup the local object store, which acts as a local cache, and on cache miss, we then lookup the remote object store.

BugenZhao · 2022-05-12T05:28:28Z

src/storage/src/hummock/cache.rs

+ #[cfg(debug_assertions)]
+ {
+ assert!(!(*old_entry).is_in_lru());
+ assert!((*new_entry).is_in_lru());
+ }


Use debug_assert here?

is_in_lru only compiles when debug_assertions is enabled, while code generated from debug_assert is if cfg!(debug_asserts){...}, whose body is still compiled even when debug_assertions is disabled. Therefore, if we use debug_asserts here we will get a compile error.

src/storage/src/object/disk.rs

wenym1 · 2022-05-12T09:37:50Z

Just updated the benchmark. Added benchmarking with spawn_block(std_file.read/pread) and block-in-place pread.

New benchmark result is as followed:

fs_operation/tokio/write    time:   [44.744 ms 45.613 ms 46.502 ms]
Bench tokio write: op 3270 times, avg time: 4.022018348623853 ms

fs_operation/tokio/read time:   [57.975 ms 59.133 ms 60.315 ms]
Bench tokio read: op 1630 times, avg time: 4.62760736196319 ms

fs_operation/tokio/blocking-read   time:   [14.325 ms 14.435 ms 14.547 ms]
Bench tokio blocking-read: op 6550 times, avg time: 0.35404580152671755 ms

fs_operation/tokio/blocking-pread     time:   [14.186 ms 14.347 ms 14.525 ms]
Bench tokio blocking-pread: op 6550 times, avg time: 0.32916030534351143 ms

fs_operation/std/write  time:   [10.288 ms 10.329 ms 10.409 ms] 
Bench std write: op 10110 times, avg time: 0.0016815034619188922 ms

fs_operation/std/read   time:   [11.364 ms 11.440 ms 11.521 ms]
Bench std read: op 10110 times, avg time: 0.004055390702274975 ms

fs_operation/std/pread  time:   [11.649 ms 11.727 ms 11.813 ms]
Bench std pread: op 10110 times, avg time: 0.001582591493570722 ms

wenym1 · 2022-05-13T04:25:23Z

Just added an opened file cache. Used spawn_blocking as a work-around to call file.read_exact_at.

Also extracted the common logic of deduplicating concurrent request on cache entry in cache.lookup_with_request_dedup. Block cache and meta cache, which uses the cache.lookup_for_request, are refactored accordingly.

TennyZhuang · 2022-05-13T05:56:36Z

src/storage/src/hummock/cache.rs

+ #[cfg(debug_assertions)]
 assert!((*new_entry).is_in_lru());


Suggested change

#[cfg(debug_assertions)]

assert!((*new_entry).is_in_lru());

debug_assert!((*new_entry).is_in_lru());

I guess this will work

It's not working if we are running cargo bench, where unit test will be compile while debug_assertions is not enabled.

The code of is_in_lru only exists when debug_assertions is enabled.

#[cfg(debug_assertions)] fn is_in_lru(&self) -> bool { (self.flags & IN_LRU) > 0 }

, while the generated code of debug_assert is

if cfg!(debug_assertions) { ... }

, and usage for is_in_lru still exists even if debug_assertions is not enabled, and when running cargo bench, the compilation will faill.

okay, I originally expected that debug_asserts is equivalent to #[cfg(debug_assertions)].

TennyZhuang · 2022-05-13T05:56:46Z

src/storage/src/hummock/cache.rs

+ #[cfg(debug_assertions)]
+ {
+ assert!(!(*old_entry).is_in_lru());
+ assert!((*new_entry).is_in_lru());
+ }


src/storage/src/object/disk.rs

skyzh

Rest LGTM, good work!

src/storage/src/hummock/cache.rs

skyzh · 2022-05-16T05:48:55Z

src/storage/src/hummock/cache.rs

+ match self.lookup_for_request(hash, key.clone()) {
+ LookupResult::Cached(entry) => Ok(entry),
+ LookupResult::WaitPendingRequest(recv) => {
+ let entry = recv.await.map_err(HummockError::other)?;


Better to return the original error to the caller, instead of channel closed or something unclear. Here we can only get channel closed error.

Previously, moka will return Arc<Error> to all callers.

The original error may not support Clone?

skyzh · 2022-05-16T05:51:17Z

src/storage/src/hummock/cache.rs

+ #[cfg(debug_assertions)]
 assert!((*new_entry).is_in_lru());


okay, I originally expected that debug_asserts is equivalent to #[cfg(debug_assertions)].

skyzh · 2022-05-16T05:53:32Z

src/storage/src/object/disk.rs

+ }
+ }
+
+ async fn readv(&self, path: &str, block_locs: Vec<BlockLocation>) -> ObjectResult<Vec<Bytes>> {


block_locs: impl AsRef<[BlockLocation]>, so that users can either pass a reference &[BlockLocation { xxx }] or a vector.

skyzh · 2022-05-16T05:54:47Z

src/storage/src/object/mem.rs

+impl Drop for InMemObjectStore {
+ fn drop(&mut self) {
+ if let Some(sender) = self.compactor_shutdown_sender.lock().take() {
+ let _ = sender.send(());


better to join the compactor thread (future), otherwise there will be some unexpected errors on SIGINT.

Why notify compactor shutdown in objectstore?
We shall notify compactor shutdown in state-store closed.

Little-Wallace

LGTM

feat(storage): suppport disk object store

ef82be0

wenym1 requested a review from hzxa21 May 9, 2022 10:18

wenym1 self-assigned this May 9, 2022

github-actions bot added the type/feature label May 9, 2022

skyzh reviewed May 10, 2022

View reviewed changes

Little-Wallace reviewed May 10, 2022

View reviewed changes

wenym1 added 3 commits May 10, 2022 16:29

Merge branch 'main' into yiming/disk_object_store

70ed405

add benchmark for tokio and std file system

7a5a77d

update bench

d392deb

BugenZhao reviewed May 12, 2022

View reviewed changes

BugenZhao requested a review from MrCroxx May 12, 2022 05:40

update benches

34b1f12

wenym1 added 2 commits May 12, 2022 18:33

refine disk object error

f43a803

add opened file cache for read

452db31

wenym1 requested review from skyzh, Little-Wallace and BugenZhao May 13, 2022 04:20

wenym1 added 2 commits May 13, 2022 13:13

fix unused import

4778a37

resolve unused import

d8386f2

TennyZhuang reviewed May 13, 2022

View reviewed changes

fix non macos compile

f024f68

Little-Wallace reviewed May 13, 2022

View reviewed changes

src/storage/src/object/disk.rs Show resolved Hide resolved

hzxa21 mentioned this pull request May 16, 2022

feat(storage): support sstable store with local object store #2558

Merged

2 tasks

skyzh approved these changes May 16, 2022

View reviewed changes

use &[...] in object store readv

e15caf5

Little-Wallace approved these changes May 16, 2022

View reviewed changes

wenym1 merged commit 10cce2e into main May 16, 2022

wenym1 deleted the yiming/disk_object_store branch May 16, 2022 08:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(storage): suppport disk object store #2389

feat(storage): suppport disk object store #2389

wenym1 commented May 9, 2022 •

edited

Loading

codecov bot commented May 9, 2022 •

edited

Loading

skyzh commented May 9, 2022 •

edited

Loading

wenym1 commented May 10, 2022

skyzh commented May 10, 2022 •

edited

Loading

skyzh May 10, 2022

wenym1 May 10, 2022

skyzh May 10, 2022

wenym1 May 10, 2022

skyzh May 10, 2022

Little-Wallace May 12, 2022

skyzh May 10, 2022

wenym1 May 10, 2022

Little-Wallace May 10, 2022

wenym1 May 11, 2022

Little-Wallace May 13, 2022

wenym1 May 16, 2022

BugenZhao May 12, 2022

wenym1 May 12, 2022

TennyZhuang May 13, 2022

wenym1 commented May 12, 2022

wenym1 commented May 13, 2022

TennyZhuang May 13, 2022

wenym1 May 13, 2022

skyzh May 16, 2022

TennyZhuang May 13, 2022

skyzh left a comment

skyzh May 16, 2022

skyzh May 16, 2022

Little-Wallace May 16, 2022

skyzh May 16, 2022

skyzh May 16, 2022

skyzh May 16, 2022

Little-Wallace May 16, 2022

Little-Wallace left a comment

	#[cfg(debug_assertions)]
	assert!((*new_entry).is_in_lru());
	debug_assert!((*new_entry).is_in_lru());

feat(storage): suppport disk object store #2389

feat(storage): suppport disk object store #2389

Conversation

wenym1 commented May 9, 2022 • edited Loading

What's changed and what's your intention?

Checklist

Refer to a related PR or issue link (optional)

codecov bot commented May 9, 2022 • edited Loading

Codecov Report

skyzh commented May 9, 2022 • edited Loading

wenym1 commented May 10, 2022

skyzh commented May 10, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wenym1 commented May 12, 2022

wenym1 commented May 13, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skyzh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Little-Wallace left a comment

Choose a reason for hiding this comment

wenym1 commented May 9, 2022 •

edited

Loading

codecov bot commented May 9, 2022 •

edited

Loading

skyzh commented May 9, 2022 •

edited

Loading

skyzh commented May 10, 2022 •

edited

Loading