Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

statedb: allow longer state pruning history #11980

Merged
merged 21 commits into from
Sep 1, 2022

Conversation

NingLin-P
Copy link
Contributor

@NingLin-P NingLin-P commented Aug 4, 2022

Fixes #11911

This PR introduces a database-backed queue for the state pruning window, which only keeps a few blocks in memory and loads more blocks on demand, thus reducing the burden of setting a long state pruning history.

Note that the queue required the backend database to support reference counting (i.e parity-db), databases that do not support reference counting (i.e rocksdb) still need to keep all blocks in memory

polkadot address: 1CWyso9Zw46QdR54sTvKJXTUdmZznYZa2aJfNHdG6xQ6L71

@cla-bot-2021
Copy link

cla-bot-2021 bot commented Aug 4, 2022

User @NingLin-P, please sign the CLA here.

@NingLin-P
Copy link
Contributor Author

PTAL @arkpar

Signed-off-by: linning <[email protected]>
@arkpar
Copy link
Member

arkpar commented Aug 5, 2022

Thank you for the PR! Consider adding your polkadot address in the format described here.

For now I only took a basic look and it looks mostly good. There's just one issue. I don't think there's any real need to maintain DbBacked::hashs. The problem with hashs is that whie the memory requirement is much smaller, it is still O(n). The bigger problem is that it requires loading all journal records on startup, which can take a long time with millions of records.

We can always get the next hash to prune from the database by loading the next journal record. For the next_hash to work, we can preload the next record. So instead of hashs we can maintain next which is a preloaded JournalRecord for the next block to pune. When it is time to prune a block 'N', we just use the data from next and then preload the next block ('N' + 1) from the database into next.

@NingLin-P
Copy link
Contributor Author

There's just one issue. I don't think there's any real need to maintain DbBacked::hashs. The problem with hashs is that whie the memory requirement is much smaller, it is still O(n). The bigger problem is that it requires loading all journal records on startup, which can take a long time with millions of records.

Indeed, DbBacked::hashs still have these problems but the real need to maintain DbBacked::hashs is RefWindow::have_block (we should replace hashs: VecDeque with BTreeSet though).

I have also checked the upper callers that required RefWindow::have_block but it seems hard to remove DbBacked::hashs without other ways to determine if a block is canonicalized (then we can implement have_block just by checking the block number)

@arkpar
Copy link
Member

arkpar commented Aug 5, 2022

I have also checked the upper callers that required RefWindow::have_block but it seems hard to remove DbBacked::hashs without other ways to determine if a block is canonicalized (then we can implement have_block just by checking the block number)

All the block in RefWindow are guaranteed to be canonicalized. Blocks that are not yet canonicalized are handled by NonCanonicalOverlay. So that's not an issue.

have_block is surrently used in

  • is_pruned - we can check by number here.
  • pin - it is fine to just remove the check and always add to pinned set.

@NingLin-P
Copy link
Contributor Author

All the block in RefWindow are guaranteed to be canonicalized

But the questioning block may not be canonicalized (like the parameter pass to is_pruned), the calling chain of is_pruned involved a lot of functions of the sc_client_api::backend::Backend trait implement for sc_client_db::Backend (i.e like Backend::state_at)

@arkpar
Copy link
Member

arkpar commented Aug 5, 2022

is_pruned and pin already check if the block is in the non-canonicalized overlay by calling NonCanonicalOverlay::have_block. Not to be confused with RefWindow::have_block. That's a different have_block function and there's no need to change it.

The block passed to is_pruned, or rather have_state_at may be in one of the following states:

  1. Pending canonicalization - it should be in NonCanonicalOverlay and is_pruned will return it here:
    !self.non_canonical.have_block(hash)
  2. Pending pruning. This may be checked by number as long as the header is known.
  3. Already pruned. This is the case where the header is in the db, but the number is out of the pruning window.
  4. Unknow, as in was never inserted. This is when the header is missing.

So I suppose we'd need to document that is_pruned and pin should ony be called for blocks with existing headers and add an additional check that the header exists in have_state_at and it should be fine.

For additional safety I'd add a number argument to pin and check that the number is in the ref window there as well.

@NingLin-P
Copy link
Contributor Author

@arkpar Thanks for the explanation! I have changed the implementation, but two tests of is_pruned is broken, will fix it later.

@NingLin-P
Copy link
Contributor Author

PTAL again @arkpar

@arkpar
Copy link
Member

arkpar commented Aug 8, 2022

After some consideration I can see that is_pruned/pin is still a bit problematic, considering that the caller ineed does not know when the hash is taken into account or the number. I think we can fix this with the following:

pin should return an enum

  • Pinned when the ginven hash and number are definietely known to the state DB. This is when the block is pending canonicalization, or when the hash is in the in-mem backend for RefWindow or in the cache.
  • MaybePinned this is when the block with given number is in the pruning window, but can't verify that the hash matches. MaybePinned should also be returned when archive mode is enabled. When this is returned, the backend in sc-client-db should do an additonal check for state root, like this:

    substrate/client/db/src/lib.rs

    Lines 2286 to 2290 in d042666

    Ok(header) => sp_state_machine::Storage::get(
    self.storage.as_ref(),
    &header.state_root,
    (&[], None),
    )
  • Err(..) When we know that the block definietely does not exist.

is_pruned should have a similar enum return value with similar check.

A call to have_state_at before pin should be removed. This use case is racy anyway. You can't do anything useful with the value returned by have_state_at since some other thread my cause pruning to happen and the result is obsolete right after the call. The only use for have_state_at should be for the cases when the result being up to date is not that important. E.g. when showing block status to the user.

Signed-off-by: linning <[email protected]>
@NingLin-P
Copy link
Contributor Author

NingLin-P commented Aug 9, 2022

Comments addressed

pin is implemented a bit differently, because pin need to pin the block according to the have_block result immediately, so we can't return a MaybePin to the caller to do further checking then pin again. I pass a closure to pin instead, which will do the further checking inside pin if needed.

@NingLin-P
Copy link
Contributor Author

Perhaps we can handle the MayHaveBlock case in a simpler way. Currently, with the block number passing to RefWindow::have_block we can load the exact single journal record from db and check if its hash matches the hash parameter (thanks to the block number we don't need to iterate over the whole queue to find the hash as the previous implementation does).

Since the sp_state_machine::Storage::get may also result in a db read, this approach does not bring much cost but is much simpler.

@arkpar
Copy link
Member

arkpar commented Aug 9, 2022

pin is implemented a bit differently, because pin need to pin the block according to the have_block result immediately, so we can't return a MaybePin to the caller to do further checking then pin again. I pass a closure to pin instead, which will do the further checking inside pin if needed.

I was imagning that MaybePinned actuall places the pin, and if the additional check fails, unpin should be called. But closure works as well. pin should be made to return a guard really, but that's out of scope of this PR.

Perhaps we can handle the MayHaveBlock case in a simpler way. Currently, with the block number passing to RefWindow::have_block we can load the exact single journal record from db and check if its hash matches the hash parameter (thanks to the block number we don't need to iterate over the whole queue to find the hash as the previous implementation does).

The difference is how much data needs to be queried from the DB. Single root trie node is small, but the journal record may contain a lot of keys. An alternative storage scheme would me more efficient, but I don't think it is worth the effort to change it now.

client/state-db/src/lib.rs Outdated Show resolved Hide resolved
client/state-db/src/lib.rs Outdated Show resolved Hide resolved
client/state-db/src/lib.rs Outdated Show resolved Hide resolved
@arkpar arkpar requested a review from bkchr August 9, 2022 18:35
@cheme
Copy link
Contributor

cheme commented Aug 10, 2022

Actually, the pruning window is a FIFO queue now, blocks are put to the back (import ) and pop out from the head (prune_one and apply_pending). The cache is only a slice of successive blocks at the head of the window (it is also a FIFO queue), to reduce the burden of keeping all block in memory thus allow longer state pruning window.

yes fifo was wrong wording from me, what I meant was in your code you give priority to have first out block over first in; opposite could be done. The fact that we force consecutive range makes it we are emptying back of vecdeque, batch refill it, emptying, and so on, so one idea would be to just keep index of block and not force consecutive range (avoid batch of query this way). But that is only for when the pruning range don't fit in memory.

@NingLin-P
Copy link
Contributor Author

so one idea would be to just keep index of block and not force consecutive range (avoid batch of query this way). But that is only for when the pruning range don't fit in memory.

Oh, the code emphasizes consecutive a lot is because blocks are pruning in consecutive order. The pruning window is a FIFO queue, new canonicalized block put in the back of the queue, when the queue's size is exceed the configured size, the block in the head is pop out and pruned until the size <= the configured size again. Since the blocks in the pruing window is only used for pruning, we only load the block into cache when it is gonna be pruned.

@cheme
Copy link
Contributor

cheme commented Aug 11, 2022

I think I looked in the wrong direction, what matters is to have one for the case where the window is in the default size (I believe it is 256 from

Self { max_blocks: Some(256), max_mem: None }
, should be exposed and reuse as a constant).
I was looking at what happen when we got bigger window than the cache, and then just not having the active caching is also fine.

const PRUNING_JOURNAL: &[u8] = b"pruning_journal";
// Default pruning window size plus a magic number keep most common ops in cache
const CACHE_BATCH_SIZE: usize = 256 + 10;
Copy link
Contributor

@cheme cheme Aug 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could expose 256 as another constant.
Also this size could be calculated by actual constraint value in state-db/src/lib.rs and not be constant.
Also, if the state-db do have a constraint > 256 (--pruning-size param), we could then just disable the cache and always fetch from db on pop_front.
This is a alternative, just save some cache memory and avoid the batch download.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could expose 256 as another constant.
Also this size could be calculated by actual constraint value in state-db/src/lib.rs and not be constant.

Yes, we can calculate this cache size from the default window size, but overall we still need a constant to determine the max cache size.

Also, if the state-db do have a constraint > 256 (--pruning-size param), we could just disable the cache and always fetch from db on pop_front.
This is a alternative, just save some cache memory and avoid the batch download.

Having a cache will be helpful. Since all blocks are eventually needed to prune (thus need to load from db if window size larger than the max cache size), batch loading doesn't bring more cost.

And cache may be more friendly to the underlying caching (i.e page cache) as cache items are small (block hash + delete keys which is also hash I think) multiple items can fit in the same page cache. Also having a cache can avoid re-loading the block in case of revert_pending and it also makes the code clean as we don't need to add
the constraint > 256 condition around.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a cache will be helpful. Since all blocks are eventually needed to prune (thus need to load from db if window size larger than the max cache size), batch loading doesn't bring more cost.

The point of not having it is mainly to reduce code (maybe the query of 256 item would not be visible indeed).
As mention before, my opinion (may not be correct) on cache for pruning windows > cache size would be to use a sparse array (so not all item will have to be loaded/read from db). But really not a concern since first goal of this cache seems to be to maintain similar perf when using reasonable pruning window.

Signed-off-by: linning <[email protected]>
Copy link
Contributor

@cheme cheme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is my last batch of comments 👍 ( I renounce changing your mind on the queue design)

client/db/src/lib.rs Show resolved Hide resolved
client/state-db/src/pruning.rs Outdated Show resolved Hide resolved
client/state-db/src/pruning.rs Show resolved Hide resolved
client/state-db/src/pruning.rs Show resolved Hide resolved
client/state-db/src/pruning.rs Outdated Show resolved Hide resolved
// NOTE: `pop_front` should not return `MetaDb::Error` because blocks are visited
// by `RefWindow::prune_one` first then `RefWindow::apply_pending` and
// `DeathRowQueue::get` should load the blocks into cache already
.expect("block must loaded in cache thus no MetaDb::Error")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NOTE does not make much sense to me, for me we just do not expect to have a missing journal and queue (queue and uncached) to be empty if pending_pruning is 0. (and pending pruning to be out of queue but this is next expectaiton)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pop_front return Result<Option<DeathRow<BlockHash, Key>>, Error<D::Error>>, the first expect is for the potential loading from db thus brings potential Error<D::Error> which is unexpected because there should no db loading for pop_front in the current usage as the comment explained. And the second expect is for the case you mentioned.

client/state-db/src/lib.rs Outdated Show resolved Hide resolved
Comment on lines +502 to +507
if self.base == 0 && self.queue.len() == 0 && number > 0 {
// assume that parent was canonicalized
self.base = number;
} else if (self.base + self.queue.len() as u64) != number {
return Err(Error::StateDb(StateDbError::InvalidBlockNumber))
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is subject to:

if self.levels.is_empty() && self.last_canonicalized.is_none() && number > 0 {
// assume that parent was canonicalized
let last_canonicalized = (parent_hash.clone(), number - 1);
commit
.meta
.inserted
.push((to_meta_key(LAST_CANONICAL, &()), last_canonicalized.encode()));
self.last_canonicalized = Some(last_canonicalized);

namely that the first block put into the pruning window may not be block 0. So in the previous implementation the block number in the pruning window is incorrect, but it is okay because its public functions didn't take any block number as parameter (perhaps this is why the previous have_block needs to brute force search all blocks)

But this PR optimized have_block by using block number indexing the query block thus causing the syncs_state test to fail, thus we need add this change to keep consistent with the non_canonical layer. Also, I have reverted the change in sc_state_db::tests::make_test_db (previously mentioned in #11980 (comment)) to add test cases (along with syncs_state) to this change.
/cc @arkpar @cheme

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NingLin-P , not sure if related, but I tried to run polkadot patched with this PR and prune seems to infiniloop ("Trying to prune when there's nothing to prune").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any log (trace level), I'd like to have a check

Copy link
Contributor

@cheme cheme Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://gist.github.com/cheme/f1ea299959f23d94636160cb00ace674
But actually does not seems related to your pr, there is something else I need to find (this only happen with parity-db).
Edit: well master with parity-db works fine so it should relate to this pr changes in some way

Copy link
Contributor

@cheme cheme Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I am stupid!!! rocksdb do not use the DB backend. So yes trace attached possibly shows an issue with this PR.
Got them with ./target/release/polkadot -d ./tmp1 --state-pruning 256 --log state-db=trace --log client-db=trace 2&> log
Note that starting with a window of 900 blocks, I don't see the issue anymore (maybe using force_canonicalize ?).
Edit: actually seems more likely we are trying to get a key from db that is still in commit_set. (I see insert at some point and miss 1 millis after).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually seems more likely we are trying to get a key from db that is still in commit_set. (I see insert at some point and miss 1 millis after).

Yes, I also found the cause of this issue is db return None for some blocks that had previously been added to the commit_set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking fix this issue by

  1. Adding an additional cache of recent added blocks, so we can get blocks that not yet commit to the database from this cache
  2. Or, don't count block that not yet commit to the database for window_size, so we will not prune block that not yet commit to the database

but both approach required to do some cleanup work (clear cache or adjust window_size) in apply_pending/revert_pending, and seems these functions may be called concurrently which may mix up the state:

substrate/client/db/src/lib.rs

Lines 1940 to 1949 in bdd1399

match self.try_commit_operation(operation) {
Ok(_) => {
self.storage.state_db.apply_pending();
Ok(())
},
e @ Err(_) => {
self.storage.state_db.revert_pending();
e
},
}

For example, two CommitSet c1 and c2, apply_pending is called for c2 first, if later revert_pending need to call for c1 there is nothing left to revert since previous apply_pending alreay apply all the change (not sure if this problem exist in master branch).

Adding some counter state to identify which CommitSet the apply_pending/revert_pending is calling for may workaround this problem, but it is still tricky to recover from case like some new blocks success commit to db but some older blocks fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an issue I did not foresee :(
A third possibility would be to just not prune and prune more when available in db.
On prune_one, we prune from last pruned to last canonicalized - range and skip if not available (maybe it requires to maintain a new last_pruned counter in db, and maybe something else for warp synch).
@arkpar, this issue makes me wonder if there may be some unpruned content, I mean if currently we may prune over content that is still in a commit set during synch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, two CommitSet c1 and c2, apply_pending is called for c2 first, if later revert_pending need to call for c1 there is nothing left to revert since previous apply_pending alreay apply all the change (not sure if this problem exist in master branch).

commit_operatoin only happens in one thread. There's a higher level lock in the Client that synchronises block import operations. This low level API is prone to misuse indeed. I'm considering removing the pending state alltogether and simply re-create the statedb object to recover from error.

@arkpar, this issue makes me wonder if there may be some unpruned content, I mean if currently we may prune over content that is still in a commit set during synch.

This should not be happening. Blocks are imported one at a time and we only prune finalized blocks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is fixed, PTAL again.

Copy link
Contributor

@cheme cheme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, a few question, but nothing blocking.

I ran a few synch tests and got an error if starting a new chain with "--pruning 0" but I think we can ignore it.

)
.unwrap_or(None)
.is_some(),
_ => false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I think it would be better to return a ClientError (have_state_at seems to be only call by function how return it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the return type here will also require changing the Backend interface, some callers of it may not expect an error

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote for changing it (client db is quite internal to substrate I think). But let's wait for second feed back.

client/state-db/src/lib.rs Outdated Show resolved Hide resolved
client/state-db/src/lib.rs Outdated Show resolved Hide resolved
client/state-db/src/pruning.rs Outdated Show resolved Hide resolved

/// The result return by `RefWindow::have_block`
#[derive(Debug, PartialEq, Eq)]
pub enum HaveBlock {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option instead of HaveBlock or IsPruned would be less expressive, but would make code a bit shorter (using map and unwrap_or).
Not sure if better, just mentioning that I am not against it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, it is a trade off hard to make

client/state-db/src/pruning.rs Outdated Show resolved Hide resolved
client/state-db/src/pruning.rs Outdated Show resolved Hide resolved
let new_max_block = self.death_rows.len() as u64 + self.pending_number;
self.death_index.retain(|_, block| *block < new_max_block);
}
self.queue.revert_recent_add(self.base, self.pending_canonicalizations);
self.pending_canonicalizations = 0;
self.pending_prunings = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arkpar not exactly sure why we change pending_pruning here (was in the original code).

Copy link
Contributor Author

@NingLin-P NingLin-P Aug 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the reason may be the CommitSet that has the data collected in prune_one is fail to commit to the db, reset the pending_prunings could allow us to re-collect the data in prune_one and retry.

@cheme cheme added A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D2-notlive 💤 PR contains changes in a runtime directory that is not deployed to a chain that requires an audit. labels Aug 26, 2022
NingLin-P and others added 3 commits August 29, 2022 17:32
Signed-off-by: linning <[email protected]>

Co-authored-by: cheme <[email protected]>
Signed-off-by: linning <[email protected]>

Co-authored-by: cheme <[email protected]>
Signed-off-by: linning <[email protected]>
@arkpar
Copy link
Member

arkpar commented Sep 1, 2022

bot merge

@paritytech-processbot
Copy link

Waiting for commit status.

@rvalle
Copy link

rvalle commented Dec 2, 2022

@NingLin-P and @arkpar I am testing this. I can see the effect in memory but I can not query more than 4096 blocks still, I have filled paritytech/polkadot#6378 while trying to understand what is going on.

ark0f pushed a commit to gear-tech/substrate that referenced this pull request Feb 27, 2023
* introduce DbBackedQueue for the state pruning window

Signed-off-by: linning <[email protected]>

* avoid cloning for next_hash

Signed-off-by: linning <[email protected]>

* add tests

Signed-off-by: linning <[email protected]>

* make clippy happy

Signed-off-by: linning <[email protected]>

* impl have_block by checking block number

Signed-off-by: linning <[email protected]>

* refactor

Signed-off-by: linning <[email protected]>

* fix tests & add test for init db-backed queue

Signed-off-by: linning <[email protected]>

* update comment

Signed-off-by: linning <[email protected]>

* add check for have_state_at

Signed-off-by: linning <[email protected]>

* address comment

Signed-off-by: linning <[email protected]>

* renanme unload_blocks to uncached_blocks

Signed-off-by: linning <[email protected]>

* address comment

Signed-off-by: linning <[email protected]>

* fix syncs_state test

Signed-off-by: linning <[email protected]>

* address comment

Signed-off-by: linning <[email protected]>

* revert change to make_test_db to add test cases

Signed-off-by: linning <[email protected]>

* do not prune unavailable block & add tests

Signed-off-by: linning <[email protected]>

* Update client/state-db/src/lib.rs

Signed-off-by: linning <[email protected]>

Co-authored-by: cheme <[email protected]>

* Update client/state-db/src/pruning.rs

Signed-off-by: linning <[email protected]>

Co-authored-by: cheme <[email protected]>

* address comment

Signed-off-by: linning <[email protected]>

Signed-off-by: linning <[email protected]>
Co-authored-by: cheme <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D2-notlive 💤 PR contains changes in a runtime directory that is not deployed to a chain that requires an audit.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow longer state pruning history
4 participants