-
Notifications
You must be signed in to change notification settings - Fork 2.6k
statedb: allow longer state pruning history #11980
statedb: allow longer state pruning history #11980
Conversation
Signed-off-by: linning <[email protected]>
Signed-off-by: linning <[email protected]>
Signed-off-by: linning <[email protected]>
User @NingLin-P, please sign the CLA here. |
PTAL @arkpar |
Signed-off-by: linning <[email protected]>
Thank you for the PR! Consider adding your polkadot address in the format described here. For now I only took a basic look and it looks mostly good. There's just one issue. I don't think there's any real need to maintain We can always get the next hash to prune from the database by loading the next journal record. For the |
Indeed, I have also checked the upper callers that required |
All the block in
|
But the questioning block may not be canonicalized (like the parameter pass to |
The block passed to
So I suppose we'd need to document that For additional safety I'd add a number argument to |
Signed-off-by: linning <[email protected]>
Signed-off-by: linning <[email protected]>
@arkpar Thanks for the explanation! I have changed the implementation, but two tests of |
Signed-off-by: linning <[email protected]>
Signed-off-by: linning <[email protected]>
Signed-off-by: linning <[email protected]>
PTAL again @arkpar |
After some consideration I can see that is_pruned/pin is still a bit problematic, considering that the caller ineed does not know when the hash is taken into account or the number. I think we can fix this with the following:
A call to |
Signed-off-by: linning <[email protected]>
Comments addressed
|
Perhaps we can handle the Since the |
I was imagning that
The difference is how much data needs to be queried from the DB. Single root trie node is small, but the journal record may contain a lot of keys. An alternative storage scheme would me more efficient, but I don't think it is worth the effort to change it now. |
yes fifo was wrong wording from me, what I meant was in your code you give priority to have first out block over first in; opposite could be done. The fact that we force consecutive range makes it we are emptying back of vecdeque, batch refill it, emptying, and so on, so one idea would be to just keep index of block and not force consecutive range (avoid batch of query this way). But that is only for when the pruning range don't fit in memory. |
Oh, the code emphasizes consecutive a lot is because blocks are pruning in consecutive order. The pruning window is a FIFO queue, new canonicalized block put in the back of the queue, when the queue's size is exceed the configured size, the block in the head is pop out and pruned until the size <= the configured size again. Since the blocks in the pruing window is only used for pruning, we only load the block into cache when it is gonna be pruned. |
I think I looked in the wrong direction, what matters is to have one for the case where the window is in the default size (I believe it is 256 from substrate/client/state-db/src/lib.rs Line 269 in a1a9b47
I was looking at what happen when we got bigger window than the cache, and then just not having the active caching is also fine. |
client/state-db/src/pruning.rs
Outdated
const PRUNING_JOURNAL: &[u8] = b"pruning_journal"; | ||
// Default pruning window size plus a magic number keep most common ops in cache | ||
const CACHE_BATCH_SIZE: usize = 256 + 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could expose 256 as another constant.
Also this size could be calculated by actual constraint value in state-db/src/lib.rs and not be constant.
Also, if the state-db do have a constraint > 256 (--pruning-size param), we could then just disable the cache and always fetch from db on pop_front.
This is a alternative, just save some cache memory and avoid the batch download.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could expose 256 as another constant.
Also this size could be calculated by actual constraint value in state-db/src/lib.rs and not be constant.
Yes, we can calculate this cache size from the default window size, but overall we still need a constant to determine the max cache size.
Also, if the state-db do have a constraint > 256 (--pruning-size param), we could just disable the cache and always fetch from db on pop_front.
This is a alternative, just save some cache memory and avoid the batch download.
Having a cache will be helpful. Since all blocks are eventually needed to prune (thus need to load from db if window size larger than the max cache size), batch loading doesn't bring more cost.
And cache may be more friendly to the underlying caching (i.e page cache) as cache items are small (block hash + delete keys which is also hash I think) multiple items can fit in the same page cache. Also having a cache can avoid re-loading the block in case of revert_pending
and it also makes the code clean as we don't need to add
the constraint > 256
condition around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a cache will be helpful. Since all blocks are eventually needed to prune (thus need to load from db if window size larger than the max cache size), batch loading doesn't bring more cost.
The point of not having it is mainly to reduce code (maybe the query of 256 item would not be visible indeed).
As mention before, my opinion (may not be correct) on cache for pruning windows > cache size would be to use a sparse array (so not all item will have to be loaded/read from db). But really not a concern since first goal of this cache seems to be to maintain similar perf when using reasonable pruning window.
Signed-off-by: linning <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is my last batch of comments 👍 ( I renounce changing your mind on the queue design)
// NOTE: `pop_front` should not return `MetaDb::Error` because blocks are visited | ||
// by `RefWindow::prune_one` first then `RefWindow::apply_pending` and | ||
// `DeathRowQueue::get` should load the blocks into cache already | ||
.expect("block must loaded in cache thus no MetaDb::Error") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NOTE does not make much sense to me, for me we just do not expect to have a missing journal and queue (queue and uncached) to be empty if pending_pruning is 0. (and pending pruning to be out of queue but this is next expectaiton)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pop_front
return Result<Option<DeathRow<BlockHash, Key>>, Error<D::Error>>
, the first expect
is for the potential loading from db thus brings potential Error<D::Error>
which is unexpected because there should no db loading for pop_front
in the current usage as the comment explained. And the second expect
is for the case you mentioned.
Signed-off-by: linning <[email protected]>
Signed-off-by: linning <[email protected]>
Signed-off-by: linning <[email protected]>
if self.base == 0 && self.queue.len() == 0 && number > 0 { | ||
// assume that parent was canonicalized | ||
self.base = number; | ||
} else if (self.base + self.queue.len() as u64) != number { | ||
return Err(Error::StateDb(StateDbError::InvalidBlockNumber)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is subject to:
substrate/client/state-db/src/noncanonical.rs
Lines 251 to 258 in 055453e
if self.levels.is_empty() && self.last_canonicalized.is_none() && number > 0 { | |
// assume that parent was canonicalized | |
let last_canonicalized = (parent_hash.clone(), number - 1); | |
commit | |
.meta | |
.inserted | |
.push((to_meta_key(LAST_CANONICAL, &()), last_canonicalized.encode())); | |
self.last_canonicalized = Some(last_canonicalized); |
namely that the first block put into the pruning window may not be block 0. So in the previous implementation the block number in the pruning window is incorrect, but it is okay because its public functions didn't take any block number as parameter (perhaps this is why the previous have_block
needs to brute force search all blocks)
But this PR optimized have_block
by using block number indexing the query block thus causing the syncs_state
test to fail, thus we need add this change to keep consistent with the non_canonical
layer. Also, I have reverted the change in sc_state_db::tests::make_test_db
(previously mentioned in #11980 (comment)) to add test cases (along with syncs_state
) to this change.
/cc @arkpar @cheme
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NingLin-P , not sure if related, but I tried to run polkadot patched with this PR and prune
seems to infiniloop ("Trying to prune when there's nothing to prune").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any log (trace
level), I'd like to have a check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://gist.github.com/cheme/f1ea299959f23d94636160cb00ace674
But actually does not seems related to your pr, there is something else I need to find (this only happen with parity-db).
Edit: well master with parity-db works fine so it should relate to this pr changes in some way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I am stupid!!! rocksdb do not use the DB backend. So yes trace attached possibly shows an issue with this PR.
Got them with ./target/release/polkadot -d ./tmp1 --state-pruning 256 --log state-db=trace --log client-db=trace 2&> log
Note that starting with a window of 900 blocks, I don't see the issue anymore (maybe using force_canonicalize
?).
Edit: actually seems more likely we are trying to get a key from db that is still in commit_set. (I see insert at some point and miss 1 millis after).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually seems more likely we are trying to get a key from db that is still in commit_set. (I see insert at some point and miss 1 millis after).
Yes, I also found the cause of this issue is db return None
for some blocks that had previously been added to the commit_set
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking fix this issue by
- Adding an additional cache of recent added blocks, so we can get blocks that not yet commit to the database from this cache
- Or, don't count block that not yet commit to the database for
window_size
, so we will not prune block that not yet commit to the database
but both approach required to do some cleanup work (clear cache or adjust window_size
) in apply_pending/revert_pending
, and seems these functions may be called concurrently which may mix up the state:
substrate/client/db/src/lib.rs
Lines 1940 to 1949 in bdd1399
match self.try_commit_operation(operation) { | |
Ok(_) => { | |
self.storage.state_db.apply_pending(); | |
Ok(()) | |
}, | |
e @ Err(_) => { | |
self.storage.state_db.revert_pending(); | |
e | |
}, | |
} |
For example, two CommitSet
c1
and c2
, apply_pending
is called for c2
first, if later revert_pending
need to call for c1
there is nothing left to revert since previous apply_pending
alreay apply all the change (not sure if this problem exist in master branch).
Adding some counter state to identify which CommitSet
the apply_pending/revert_pending
is calling for may workaround this problem, but it is still tricky to recover from case like some new blocks success commit to db but some older blocks fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an issue I did not foresee :(
A third possibility would be to just not prune and prune more when available in db.
On prune_one
, we prune from last pruned
to last canonicalized - range
and skip if not available (maybe it requires to maintain a new last_pruned counter in db, and maybe something else for warp synch).
@arkpar, this issue makes me wonder if there may be some unpruned content, I mean if currently we may prune over content that is still in a commit set during synch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, two CommitSet c1 and c2, apply_pending is called for c2 first, if later revert_pending need to call for c1 there is nothing left to revert since previous apply_pending alreay apply all the change (not sure if this problem exist in master branch).
commit_operatoin
only happens in one thread. There's a higher level lock in the Client
that synchronises block import operations. This low level API is prone to misuse indeed. I'm considering removing the pending state alltogether and simply re-create the statedb object to recover from error.
@arkpar, this issue makes me wonder if there may be some unpruned content, I mean if currently we may prune over content that is still in a commit set during synch.
This should not be happening. Blocks are imported one at a time and we only prune finalized blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is fixed, PTAL again.
Signed-off-by: linning <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, a few question, but nothing blocking.
I ran a few synch tests and got an error if starting a new chain with "--pruning 0" but I think we can ignore it.
) | ||
.unwrap_or(None) | ||
.is_some(), | ||
_ => false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I think it would be better to return a ClientError (have_state_at seems to be only call by function how return it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing the return type here will also require changing the Backend
interface, some callers of it may not expect an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vote for changing it (client db is quite internal to substrate I think). But let's wait for second feed back.
|
||
/// The result return by `RefWindow::have_block` | ||
#[derive(Debug, PartialEq, Eq)] | ||
pub enum HaveBlock { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Option instead of HaveBlock
or IsPruned
would be less expressive, but would make code a bit shorter (using map and unwrap_or).
Not sure if better, just mentioning that I am not against it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, it is a trade off hard to make
let new_max_block = self.death_rows.len() as u64 + self.pending_number; | ||
self.death_index.retain(|_, block| *block < new_max_block); | ||
} | ||
self.queue.revert_recent_add(self.base, self.pending_canonicalizations); | ||
self.pending_canonicalizations = 0; | ||
self.pending_prunings = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arkpar not exactly sure why we change pending_pruning here (was in the original code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the reason may be the CommitSet
that has the data collected in prune_one
is fail to commit to the db, reset the pending_prunings
could allow us to re-collect the data in prune_one
and retry.
Signed-off-by: linning <[email protected]> Co-authored-by: cheme <[email protected]>
Signed-off-by: linning <[email protected]> Co-authored-by: cheme <[email protected]>
Signed-off-by: linning <[email protected]>
bot merge |
Waiting for commit status. |
@NingLin-P and @arkpar I am testing this. I can see the effect in memory but I can not query more than 4096 blocks still, I have filled paritytech/polkadot#6378 while trying to understand what is going on. |
* introduce DbBackedQueue for the state pruning window Signed-off-by: linning <[email protected]> * avoid cloning for next_hash Signed-off-by: linning <[email protected]> * add tests Signed-off-by: linning <[email protected]> * make clippy happy Signed-off-by: linning <[email protected]> * impl have_block by checking block number Signed-off-by: linning <[email protected]> * refactor Signed-off-by: linning <[email protected]> * fix tests & add test for init db-backed queue Signed-off-by: linning <[email protected]> * update comment Signed-off-by: linning <[email protected]> * add check for have_state_at Signed-off-by: linning <[email protected]> * address comment Signed-off-by: linning <[email protected]> * renanme unload_blocks to uncached_blocks Signed-off-by: linning <[email protected]> * address comment Signed-off-by: linning <[email protected]> * fix syncs_state test Signed-off-by: linning <[email protected]> * address comment Signed-off-by: linning <[email protected]> * revert change to make_test_db to add test cases Signed-off-by: linning <[email protected]> * do not prune unavailable block & add tests Signed-off-by: linning <[email protected]> * Update client/state-db/src/lib.rs Signed-off-by: linning <[email protected]> Co-authored-by: cheme <[email protected]> * Update client/state-db/src/pruning.rs Signed-off-by: linning <[email protected]> Co-authored-by: cheme <[email protected]> * address comment Signed-off-by: linning <[email protected]> Signed-off-by: linning <[email protected]> Co-authored-by: cheme <[email protected]>
Fixes #11911
This PR introduces a database-backed queue for the state pruning window, which only keeps a few blocks in memory and loads more blocks on demand, thus reducing the burden of setting a long state pruning history.
Note that the queue required the backend database to support reference counting (i.e
parity-db
), databases that do not support reference counting (i.erocksdb
) still need to keep all blocks in memorypolkadot address: 1CWyso9Zw46QdR54sTvKJXTUdmZznYZa2aJfNHdG6xQ6L71