Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transaction logic improvement #12692

Closed
wants to merge 1 commit into from

Conversation

wilsonwang371
Copy link
Contributor

server: change Txn() to use one transaction to avoid txn.End() extra overhead.

based on our observations, txn operation is taking a lot of cpu time compared with normal put operations. with some profiling, we found it is because of txn.End() operation. To avoid extra overhead, we should use one transaction inside Txn(). With this change, the benchmark txn-put operation performance can get a significant improvement.

@codecov-io
Copy link

Codecov Report

Merging #12692 (e177b82) into master (09f6b7f) will increase coverage by 1.08%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #12692      +/-   ##
==========================================
+ Coverage   63.18%   64.27%   +1.08%     
==========================================
  Files         449      409      -40     
  Lines       62349    32833   -29516     
==========================================
- Hits        39397    21104   -18293     
+ Misses      18381     9593    -8788     
+ Partials     4571     2136    -2435     
Flag Coverage Δ
all 64.27% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
client/v2/auth_role.go 0.00% <ø> (ø)
client/v2/auth_user.go 0.00% <ø> (ø)
client/v2/cancelreq.go 100.00% <ø> (ø)
client/v2/client.go 43.54% <ø> (ø)
client/v2/cluster_error.go 50.00% <ø> (ø)
client/v2/curl.go 10.00% <ø> (ø)
client/v2/discover.go 0.00% <ø> (ø)
client/v2/json.go 32.00% <ø> (ø)
client/v2/keys.go 64.77% <ø> (ø)
client/v2/members.go 43.39% <ø> (ø)
... and 716 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a4570a6...e177b82. Read the comment docs.

@ptabor
Copy link
Contributor

ptabor commented Feb 17, 2021

Thank you Wilson, the number that you posted on etcd-dev@ looks impressive.
Would be good to post them here as well.

I'm not familiar with this code, but from scanning the logic, I understand that
we change a pair of RO->RW transactions into just RW transactions if the transaction potentially mutates any resource.
The tradeoff is that RW transaction takes exclusive lock so blocks all concurrent RO transactions.
So please read: #10523
I'm not saying its not worthwhile, but we should also consider performance impact on read-intensive scenarios.

I also wonder whether this impact overlaps with: #12529.

@ptabor ptabor requested a review from jingyih February 17, 2021 08:13
@wilsonwang371
Copy link
Contributor Author

The original post is here:
https://groups.google.com/g/etcd-dev/c/qiQ1V_iMPKo/m/7OfHsr68BAAJ

We are working on k8s cluster scaling project. Apiserver uses etcd as the backend database. The apiserver storage Create/Update/Delete operations are based on Txn operation.

However, with some evaluation comparison between TXN-PUT and PUT using etcd benchmark tool, we see that the performance penalty for TXN-PUT is very significant.

Our machine's configuration is

S09-I6VD1-US

Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
2 Sockets, 64 VCPUs

256GB

NVMe SSD-2T* 2

Mellanox CX-5

With PUT benchmark, the QPS can reach as much as 30K. However, with TXN-PUT benchmark, the performance is only around 16K at maximum.

Here is a graph of the detail. X-axis is the number of connections and clients(they are using the same value). Y-axis is the size of the Value of the KV pair (Key is always 256bytes size).

88d4b360-7675-407f-af71-e6a447f491d1

I added a new bool flag in the etcd PUT operation, called "IsCreate". When it is TRUE, that means, the put operation will check if the key is already there. If the key is already there, the applier will fail the operation and return an error.

With this change, I can see significant performance improvement in Create operations. Here is another plot that demonstrated the performance improvement percentage after switching from TXN-PUT to PUT when doing a Create operation. We can see that we can get a performance improvement to as much as 160%.

1730fbea-ae13-4d9c-a241-b48a783e1d21

With some further investigation, we found the overhead is at txn.End() in applier's Txn() function. In the latest pull request, the Txn() write operation speed can increase as much as around 160% too.

image

@wilsonwang371
Copy link
Contributor Author

Thank you Wilson, the number that you posted on etcd-dev@ looks impressive.
Would be good to post them here as well.

I'm not familiar with this code, but from scanning the logic, I understand that
we change a pair of RO->RW transactions into just RW transactions if the transaction potentially mutates any resource.
The tradeoff is that RW transaction takes exclusive lock so blocks all concurrent RO transactions.
So please read: #10523
I'm not saying its not worthwhile, but we should also consider performance impact on read-intensive scenarios.

I also wonder whether this impact overlaps with: #12529.

Hi Piotr,

Very good point, I think #12529 could be the root cause of the extra overhead in Txn(). But I didn't evaluate this yet.

At the same time, this again makes me wondering if some flags and arguments in GRPC call can make some very common db operations such as CREAT/UPDATE/DELETE can help user improve their application performance.

When we are using Txn(), we don't know in advance how many if-then-else are there. We cannot assume if user is using it in the way we expected or not.

Wilson

@wilsonwang371
Copy link
Contributor Author

@ptabor
Hi Piotr,

I am thinking if it is better to have another message called TxnRequestInternal which holds some preprocessed values so that depending on the preprocessed value we run an optimized path for the Txn().

Wilson

@ptabor
Copy link
Contributor

ptabor commented Apr 6, 2021

What's the motivation to have another message ?

  • readability
  • avoiding computing the same intrinsic multiple times

Looking at the applierV3 interface there is no prior art of using different type than the 'public' ones, that
make the bar for precedence relatively high.

If the reason is readability, you can implement additional methods in api/etcdserverpb package directly, such that:
txt.isReadonly() has natural look & fill.

@wilsonwang371
Copy link
Contributor Author

What's the motivation to have another message ?

  • readability
  • avoiding computing the same intrinsic multiple times

Looking at the applierV3 interface there is no prior art of using different type than the 'public' ones, that
make the bar for precedence relatively high.

If the reason is readability, you can implement additional methods in api/etcdserverpb package directly, such that:
txt.isReadonly() has natural look & fill.

The reason I am suggesting a new type of message is because I want to do Txn preprocessing and in the current TxnRequest implementation there is no extra field to save the preprocessed information. If we process the Txn information when it arrives to applierV3, it might be too late to do TxnRequest optimizations.

Thanks

Wilson

@ptabor
Copy link
Contributor

ptabor commented Apr 7, 2021

If we process the Txn information when it arrives to applierV3, it might be too late to do TxnRequest optimizations.

Do you have an actual scenario when its too late ? Are there missing data to make the decision ?

@wilsonwang371
Copy link
Contributor Author

If we process the Txn information when it arrives to applierV3, it might be too late to do TxnRequest optimizations.

Do you have an actual scenario when its too late ? Are there missing data to make the decision ?

Good point. This assumption is based on my knowledge that the requests are asynchronous and the applierv3 applies Txn synchronously. I assume some preprocessing inside the Txn rpc request context will be able to save more execution time than doing this inside applierv3.

What do you think?

Thanks

Wilson

@wilsonwang371
Copy link
Contributor Author

If we process the Txn information when it arrives to applierV3, it might be too late to do TxnRequest optimizations.

Do you have an actual scenario when its too late ? Are there missing data to make the decision ?

Hi Piotr,

I made some change to the code and I am going to get some new performance result and will later share with you about the result.

What I did was checking the number of operations in the Txn and if it is only one check and one real operations. We will use write txn instead of using read txn and later switch to write txn.

Wilson

@wilsonwang371 wilsonwang371 force-pushed the profiling-txn branch 2 times, most recently from d7ff8d0 to 49ffd6d Compare April 13, 2021 04:16
@ptabor
Copy link
Contributor

ptabor commented Apr 13, 2021

Thank you. It's a good comment that the more we check/precompute before the RAFT, the higher performance we can get during apply that's critical. Ideally internal requests should be different from the public to be able to easily evolve the system without fear of backward compatiblity / cross-version compatibility risks. But we are not (yet) there.

In practice: I think in-memory access to inspect proto content should be not meaningful time in contrast to Backend writes.

@wilsonwang371
Copy link
Contributor Author

With the latest change, I add some logic so that if the transaction has at most 1 operation in success, failure & compare, we will skip a read transaction and directly use a write transaction.

I evaluated the performance and we still can see a performance improvement.

I may later do some mixed read/write testing.

image

image

image

@wilsonwang371
Copy link
Contributor Author

I did some further evaluation. I modified txn-put so that it can run mixed transaction read and transaction write with a predefined ratio.
Ratio=Read/Write

I plotted the result with ratio ranging from 0.125(mostly write) to 8(mostly read).

The red & blue plot are the difference between my patch and the master branch. We can see that we get most of the performance improvement during mostly write case and in mostly read cases we didn't get a lot of improvement. (This is what we expected and thank god my patch did not affect the read transactions much.)

image

image

image

@wilsonwang371
Copy link
Contributor Author

@ptabor Piotr,

Can you take a look at the latest evaluation and give some suggestions?

Thanks

Wilson

@ptabor
Copy link
Contributor

ptabor commented Apr 19, 2021

Thank you, Wilson

Forgive me a question: What's the unit of the charts value ?
Also I'm curious: What tool was used to generate them and why the values are represented as triangles (instead of squares) ?

@wilsonwang371
Copy link
Contributor Author

Thank you, Wilson

Forgive me a question: What's the unit of the charts value ?
Also I'm curious: What tool was used to generate them and why the values are represented as triangles (instead of squares) ?

I was using python tripcolor to plot the graph.

Here is my code.

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv ('mixed-master.csv')

df['avg'] = df.iloc[:, 3: ].sum(axis=1)/5.0
bigdf = df.iloc[:, [0,1,2,8]]

plt.figure(figsize=(16, 20)) 
count = 1
for val, df in bigdf.groupby('ratio'):
    plt.subplot(3, 2, count)
    count += 1
    plt.tripcolor(df.iloc[:, 1], df.iloc[:, 2], df.iloc[:, 3])
    plt.title('Master Branch Mixed-Txn Ratio {} Performance'.format(
        val))
    plt.yscale('log',base=2)
    plt.ylabel('Value Size')
    plt.xscale('log',base=2)
    plt.xlabel('Connections Amount')
    plt.colorbar()
plt.show()


df2 = pd.read_csv('mixed-txn-optimized.csv')

df2['avg'] = df2.iloc[:, 3: ].sum(axis=1)/5.0
bigdf2 = df2.iloc[:, [0,1,2,8]]

plt.figure(figsize=(16, 20)) 
count = 1
for val, df in bigdf2.groupby('ratio'):
    plt.subplot(3, 2, count)
    count += 1
    plt.tripcolor(df2.iloc[:, 1],df2.iloc[:, 2], df2.iloc[:, 3])
    plt.title('Optimized Mixed-Txn Ratio {} Performance'.format(
        val))
    plt.yscale('log',base=2)
    plt.ylabel('Value Size')
    plt.xscale('log',base=2)
    plt.xlabel('Connections Amount')
    plt.colorbar()
plt.show()


bigdf2['minus'] = bigdf2.iloc[:, 3] - bigdf.iloc[:, 3]

plt.figure(figsize=(16, 20)) 
count = 1
for val, df2 in bigdf2.groupby('ratio'):
    plt.subplot(3, 2, count)
    count += 1
    plt.tripcolor(df2.iloc[:, 1], df2.iloc[:, 2], df2.iloc[:, 4], cmap=plt.get_cmap('seismic'))
    plt.title('(Optimized Txn - Master) Mixed-Txn Ratio {} Performance'.format(val))
    plt.yscale('log',base=2)
    plt.ylabel('Value Size')
    plt.xscale('log',base=2)
    plt.xlabel('Connections Amount')
    plt.colorbar()
plt.show()

And here are the result files.

mixed-txn-optimized.csv
mixed-master.csv

@@ -203,18 +204,21 @@ func isTxnSerializable(r *pb.TxnRequest) bool {
return true
}

func isTxnReadonly(r *pb.TxnRequest) bool {
func isTxnReadonlyAndSimple(r *pb.TxnRequest) (bool, bool) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the rename. Still please comment what does 'simple' mean.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this function back and moved the logic out.

@ptabor
Copy link
Contributor

ptabor commented Apr 19, 2021

Got it. Thank you

  1. Could you please plot the relative improvement: optimized / master ?
    bigdf2['minus'] = bigdf2.iloc[:, 3] / bigdf.iloc[:, 3]

  2. Final check: what happens for read / write ratio = 100 [ so strongly read-dominated servers] ?

@wilsonwang371 wilsonwang371 force-pushed the profiling-txn branch 2 times, most recently from b9a28b5 to 87189b8 Compare April 19, 2021 19:35
@wilsonwang371
Copy link
Contributor Author

Got it. Thank you

  1. Could you please plot the relative improvement: optimized / master ?
    bigdf2['minus'] = bigdf2.iloc[:, 3] / bigdf.iloc[:, 3]
  2. Final check: what happens for read / write ratio = 100 [ so strongly read-dominated servers] ?

This is the result of

bigdf2['minus'] = bigdf2.iloc[:, 3] / bigdf.iloc[:, 3]

image

Regarding the 2nd question, I will post updates when new data is available.

@wilsonwang371
Copy link
Contributor Author

wilsonwang371 commented Apr 20, 2021

Got it. Thank you

  1. Could you please plot the relative improvement: optimized / master ?
    bigdf2['minus'] = bigdf2.iloc[:, 3] / bigdf.iloc[:, 3]
  2. Final check: what happens for read / write ratio = 100 [ so strongly read-dominated servers] ?

Regarding the second question. I did the evaluation and plotted:

bigdf2.iloc[:, 3] / bigdf.iloc[:, 3]

Yes, we can have performance degradation in certain cases and the worst is around 90% percent of the master branch performance.

Btw, I forgot to mention that my mixed read/write testing is actually using range request to get a large range response in its read operations. This is the case where we are supposed to get the worst performance.

mixed-master-100.csv
mixed-txn-optimized-100.csv

image

@ptabor
Copy link
Contributor

ptabor commented Apr 23, 2021

Thank you for checking, No clear win unfortunately. But from what I've seen K8s clusters have the Range/Txn ratio of ~4,
so for k8s workload positive.

That probably means we would need to flaggate the new behavior.
If we were to build recommendation when its worth to disable the flag, would be the read/write ratio >50 ?

@wilsonwang371
Copy link
Contributor Author

Thank you for checking, No clear win unfortunately. But from what I've seen K8s clusters have the Range/Txn ratio of ~4,
so for k8s workload positive.

That probably means we would need to flaggate the new behavior.
If we were to build recommendation when its worth to disable the flag, would be the read/write ratio >50 ?

Hi Piotr,

How do you think if we use ReadTx() instead of ConcurrentReadTx() in applier's Txn()?

In this way, we will not deepcopy the buffer and at the same time, we are not taking a write lock while doing read operations in applier's Txn().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants