Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trillian batch operations with > 1k entries fail on MariaDB >= 10.3 #1845

Closed
mhutchinson opened this issue Sep 18, 2019 · 21 comments
Closed
Assignees

Comments

@mhutchinson
Copy link
Contributor

go test ./testonly/hammer/...
--- FAIL: TestInProcessMapHammer (9.77s)
hammer_test.go:129: hammer failure: failed to verify inclusion proof for Index=922cab9b7de0da6c6b65792d3030303030303234000000000000000000000000: calculated root: 3af78a4fc25a48d822fd41eed4d56b3afff53cf6ac4f603427120b9127eece9c, want: 8192c68b3ec424a648ef33ae4d3b970b7ff870f12aa6fca03214652d25363e61
FAIL
FAIL github.com/google/trillian/testonly/hammer 10.203s
? github.com/google/trillian/testonly/hammer/maphammer [no test files]
? github.com/google/trillian/testonly/hammer/mapreplay [no test files]
FAIL

Whatever else happens, we need to be explicit about which versions of mysql we support.

It's also worth seeing whether this version of MySQL can be supported without breaking backwards compatibility.

@gdbelvin
Copy link
Contributor

Occasionally passing for me

go test ./testonly/hammer/
ok  	github.com/google/trillian/testonly/hammer	9.462s

mysql -V
mysql  Ver 8.0.17 for osx10.14 on x86_64 (Homebrew)

go env

 go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/gbelvin/Library/Caches/go-build"
GOENV="/Users/gbelvin/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/gbelvin/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/Cellar/go/1.13/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.13/libexec/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/gbelvin/go/src/github.com/google/trillian/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/xq/1dznj60166q1qlzqry5c9c6h006zkw/T/go-build226338431=/tmp/go-build -gno-record-gcc-switches -fno-common"

mhutchinson added a commit that referenced this issue Sep 30, 2019
This ensures that issues such as #1845 can't be introduced and be unnoticed.
@mhutchinson
Copy link
Contributor Author

This works reliably for me with a min/max leaf batch of 100, but 150 fails reliably (admittedly on a small sample size):

HAMMER_OPTS="--operations=500 --min_leaves=150 --max_leaves=150 --invalid_chance=0" ./integration/maphammer.sh

@gdbelvin
Copy link
Contributor

gdbelvin commented Oct 2, 2019

Same error around the wrong roots?

@mhutchinson
Copy link
Contributor Author

Still fails to validate inclusion proofs. I'm looking into it now.

mhutchinson added a commit to mhutchinson/trillian that referenced this issue Oct 22, 2019
There is no reason in the real world that the number of leaves fetched by clients in a batch would have any relationship to the number of leaves written at a time. The single parameter was an artefact of the way the hammer was written. Allowing these to be configured separately better approximates the real world and allows the hammer to hit the map more flexibly.

Making either of the read or write flags be above 100 triggers the issue in google#1845, though it does so differently for each of the batch sizes.
mhutchinson added a commit that referenced this issue Oct 22, 2019
There is no reason in the real world that the number of leaves fetched by clients in a batch would have any relationship to the number of leaves written at a time. The single parameter was an artefact of the way the hammer was written. Allowing these to be configured separately better approximates the real world and allows the hammer to hit the map more flexibly.

Making either of the read or write flags be above 100 triggers the issue in #1845, though it does so differently for each of the batch sizes.
@mhutchinson
Copy link
Contributor Author

Confirmed that batches of > 100 work on MariaDB 10.1 and 10.2, but fail on 10.3 and 10.4.

@mhutchinson mhutchinson changed the title ./testonly/hammer_test.go fails on mariadb 10.3.12 Map read/write batches > 100 can fail on MariaDB >= 10.3 Oct 23, 2019
mhutchinson added a commit to mhutchinson/trillian that referenced this issue Oct 23, 2019
This doesn't trigger google#1845 but that in itself tells us something.
mhutchinson added a commit to mhutchinson/trillian that referenced this issue Oct 23, 2019
This doesn't trigger google#1845 but that in itself tells us something.
mhutchinson added a commit to mhutchinson/trillian that referenced this issue Oct 23, 2019
This allows for useful diagnostics when digging into issues such as google#1845, providing the correct commandline args are provided. e.g:

`go test -v ./storage/mysql/... -count 1 -args -stderrthreshold=INFO`
@gdbelvin
Copy link
Contributor

Thanks for doing all this research. Nailing this down to particular DB versions is a big step forward. Props.

@mhutchinson
Copy link
Contributor Author

mhutchinson commented Oct 24, 2019

This may be useful to other people needing to switch between DB versions to isolate problems. So I'm not the only one that acquires the dubious honour of knowing how to do this, I'm doing this using homebrew with the following set of steps. Note that in all cases I don't care about the DB contents. These steps should not be followed if you want a data-preserving upgrade/downgrade procedure:

brew services list # shows that, e.g. 10.2 is running
brew services stop [email protected]
rm -rf /usr/local/var/mysql
brew install [email protected] # or brew reinstall [email protected] if you've been here before
brew services start [email protected]
mysql -u root -e 'SELECT @@GLOBAL.version;' # confirm that you're running the server you think

Edit
Note that MariaDB 10.4 changes the security options quite a bit, so when using this version you need to do the following to get the previous behaviour of a passwordless root login (I'm not suggesting this is a good idea, but pointing out that this is required to maintain old behaviour):

sudo mysql -e 'ALTER USER root@localhost IDENTIFIED VIA mysql_native_password USING PASSWORD("");'

@mhutchinson
Copy link
Contributor Author

Here's a pretty key finding: this works on the newer versions of MariaDB if --large_preload_fix=false, but not if --large_preload_fix=true. Following the chain of code down, and the observation that revisions 0 and 1 are always calculated correctly, I'm going to go out on a limb and suggest that the queries downstream of the MySQL impl of GetMerkleNodes are broken for newer MariaDB versions when the batch size of the nodes is large. It's likely that skipping the preload means that many smaller fetches are performed instead of a giant batch. But this needs confirming.

@pavelkalinnikov how likely is this preload to be changed by your ongoing work? I feel that I'm getting closer to root causing this now, but if it's all about to be thrown away then I'll save my time for something more useful.

@pav-kv
Copy link
Contributor

pav-kv commented Oct 24, 2019

Wow, this is a great finding! My work largely overlaps with this option indeed:

  1. The PR Replace legacy sparse Merkle tree writer #1902 reimplements it entirely. However, I believe at this point the underlying queries and the exact nodes/tiles that are fetched don't change.
  2. Follow-up PRs will drop the SubtreeCache machinery sitting in between the Merkle tree code and Get/SetMerkleNodes. The cache can do sneaky reads while writing stuff, and I actually believe that the highest chance of a bug is in there.
  3. I have an action item to look at removing this option, i.e. --large_preload_fix=false by default. My current belief is that after my refactoring the large preload won't be needed.

It definitely makes sense to look at why it doesn't work now to then demonstrate that the issue was fixed and how. I am happy to play with it, and see if my interference is needed there.

Also, I have a totally made-up idea on why this might be failing. Could be a combination of the following:

  1. A +-1 error in the revision number that the SubtreeCache/SMT code fetches/writes.
  2. A thing that changes between MariaDB versions has something to do with how it reads "dirty" rows, e.g. one version could read from a snapshot while another reads the "dirty" data. Maybe the default isolation level is different, or smth like that.
  3. A race or other condition causing multiple reads/writes of the same tile.

Not sure what's so magical about this exact hundred threshold...

@pav-kv
Copy link
Contributor

pav-kv commented Oct 24, 2019

Check this thing out introduced in MariaDB 10.3: https://mariadb.com/kb/en/library/conversion-of-big-in-predicates-into-subqueries/. Note the condition enabling this optimization: "the IN list has more than 1000 elements".

Our subtrees query uses a large IN statement, which means it might be impacted by this change.

The map is split into 11 strata levels:

var defaultMapStrata = []int{8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 176}

The topmost stratum is queried separately, which means the paths in the bottom "shards" span exactly 10 strata. Now 100 random leaf updates probably span a little under 100 shards (birthday paradox might subtract 1 or 2 from that). Multiply that by 10 strata, and maybe do +1 for the root stratum, and we get just above the magical 1000 threshold.

Printing len(nodeIDs) here could help confirming this idea:

for _, nodeID := range nodeIDs {

Also, changing the strata layout should result in a different "magic threshold", e.g. for

var defaultMapStrata = []int{8, 8, 8, 8, 8, 8, 8, 8, 8, 184}

it should be ~112.

@pav-kv
Copy link
Contributor

pav-kv commented Oct 24, 2019

Confirmed.

mhutchinson added a commit that referenced this issue Oct 25, 2019
This allows for useful diagnostics when digging into issues such as #1845, providing the correct commandline args are provided. e.g:

`go test -v ./storage/mysql/... -count 1 -args -stderrthreshold=INFO`
mhutchinson added a commit to mhutchinson/trillian that referenced this issue Oct 25, 2019
These were intended to show that google#1845 could be easily reproduced with batches over 1k but they run fine. More investigation is needed, but more tests is always good.
@mhutchinson
Copy link
Contributor Author

Failing test case! #1930

mhutchinson added a commit to mhutchinson/trillian that referenced this issue Oct 25, 2019
This will flush out issues like google#1845 faster and more precisely than relying on the hammer.
mhutchinson added a commit that referenced this issue Oct 25, 2019
This will flush out issues like #1845 faster and more precisely than relying on the hammer.
mhutchinson added a commit to mhutchinson/trillian that referenced this issue Oct 25, 2019
These were intended to show that google#1845 could be easily reproduced with batches over 1k but they run fine. More investigation is needed, but more tests is always good.
mhutchinson added a commit to mhutchinson/trillian that referenced this issue Oct 28, 2019
These were intended to show that google#1845 could be easily reproduced with batches over 1k but they run fine. More investigation is needed, but more tests is always good.
@mhutchinson mhutchinson changed the title Map read/write batches > 100 can fail on MariaDB >= 10.3 Trillian batch operations with > 1k entries fail on MariaDB >= 10.3 Oct 28, 2019
@mhutchinson
Copy link
Contributor Author

Updated the title: as the tests in #1930 demonstrate, this affects the log API too. The log storage layer can be coerced into failing in multiple ways through batch operations, but only one of these is wired to the gRPC API: QueueLeaves (when >= 1k duplicate entries are queued).

The root cause appears to be https://jira.mariadb.org/browse/MDEV-12176 which is still attracting discussion from the community. It's unclear how this optimization seems to be a functional change for all of our lookups. There is one report, allegedly fixed in 10.3.3, of behavioural changes, but all ongoing discussions are around the performance impact.

Getting further into this may be a waste of our resources. As a strawman, I suggest we remove MariaDB 10.3+ from Trillian supported configurations, unless we believe there is sufficient motivation to continue supporting this. Any fix will require updating the SQL, which may impact performance/correctness in other situations.

mhutchinson added a commit to mhutchinson/trillian that referenced this issue Oct 29, 2019
These show that google#1845 can be reproduced with batches over 1k on newer
versions of MariaDB. Some of these tests are map specific, but the
QueueLeaves test can be triggered via the Trillian gRPC API.
mhutchinson added a commit that referenced this issue Oct 30, 2019
These show that #1845 can be reproduced with batches over 1k on newer
versions of MariaDB. Some of these tests are map specific, but the
QueueLeaves test can be triggered via the Trillian gRPC API.

Failing tests are all marked to be skipped until #1845 is resolved.
mhutchinson added a commit to mhutchinson/trillian that referenced this issue Nov 11, 2019
These tests would likely fail if they were run for more operations. We have found that batch sizes of < 100 is reliable, so we should stick to that for stability and clarity.
@pgporada
Copy link
Contributor

pgporada commented Jan 31, 2020

According to a MariaDB developer comment,

in_predicate_conversion_threshold now works on release builds too, this was fixed in https://jira.mariadb.org/browse/MDEV-16871. Should be available with 10.3.18 onwards.
You can set in_predicate_conversion_threshold to 0 to disable this optimization, this would be
a temporary workaround.

Unfortunately for us, MariaDB 10.3.18 isn't yet available in Amazon RDS yet. 🤕
I've sent a support request off to Amazon asking for a newer stable version of MariaDB 10.3.x to be added.

@pav-kv
Copy link
Contributor

pav-kv commented Jan 22, 2021

@pgporada Have you migrated to a newer MariaDB?

@pav-kv
Copy link
Contributor

pav-kv commented Jan 22, 2021

@mhutchinson Are there any more action items here?

@mhutchinson
Copy link
Contributor Author

This was only ever known to cause issues with the DB-backed Map, which was experimental is now EOL (#2284). So I'm not planning more work here. Any map work will be on the experimental/batchmap from now on.

@pgporada
Copy link
Contributor

pgporada commented Feb 11, 2021

We did upgrade to MariaDB 10.3.18 prior to me leaving for paternity leave (October 2020). RDS didn't allow us to change the optimizers that we needed and per a support email they had no intention of ever letting us do so. It would take an application change to allow setting in_predicate_conversion_threshold on a per query basis. I could figure out how to implement that.

@pphaneuf
Copy link
Contributor

@pgporada: Is this issue affecting your (non-map) deployment?

@pav-kv
Copy link
Contributor

pav-kv commented Feb 12, 2021

This issue is activated when a query has an IN-clause with >= 1k entries. It was discovered in map tests, and was failing in tree storage on fetching many tiles [query]. Logs don't fetch that many tiles under any circumstances, it's always O(log2(tree_size)).

However, I can see some queries in logs that do have a variable-length IN-clause:

  1. GetLeavesByIndex with many indices in the request. CTFE doesn't use it, instead it uses a better GetLeavesByRange.
  2. GetLeavesByHash with many hashes in the request. CTFE doesn't use this RPC.
  3. getLeafDataByIdentityHash. This one is activated on QueueLeaves queries RPC. CTFE doesn't use it with more than 1 leaf, so is not affected.

I believe, CT logs are not affected by this bug. As for non-CT uses of the log, we could add a limit on the number of entries to submit/query, for all of the 3 cases above.

@pphaneuf
Copy link
Contributor

Thanks for the explanation, Pavel!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants