Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating invoice + making payment is slower and slower #1506

Closed
NicolasDorier opened this issue May 15, 2018 · 55 comments
Closed

Creating invoice + making payment is slower and slower #1506

NicolasDorier opened this issue May 15, 2018 · 55 comments

Comments

@NicolasDorier
Copy link
Collaborator

I am benchmarking the simplest case where Alice pays Bob:

var invoice = await Bob.RPC.CreateInvoice(LightMoney.Satoshis(100));
await Alice.RPC.SendAsync(invoice.BOLT11);

It seems it takes more and more time. I guess auto clean invoice would solve the issue... but still, it should not be linear.

benchmarks-runalicepaysbob-facettimeline

This is one 100 iterations, where each iteration is 16 run of 3 concurrent payments.

@ZmnSCPxj
Copy link
Collaborator

It could be the number of payments on Alice, not just the number of invoices on Bob.

@NicolasDorier
Copy link
Collaborator Author

NicolasDorier commented May 15, 2018

Ok so to check this, I create 3 differents Alice instead of 3 concurrent payment by the same Alice:

benchmarks4-runalicespaybob-facettimeline

There seems to have also some contention for in the case of 3 payments of the same Alice at same time costing around 100 ms.

It seems there is indeed improvement, but I am surprised it is not a 3 times improvement. I would have expected the slope to get 3 times slower.

@NicolasDorier
Copy link
Collaborator Author

Slope:

Alice pays Bob: 2495
Alices pay Bob: 1120.75

@ZmnSCPxj
Copy link
Collaborator

Possibly the remaining part is Bob getting overloaded, too. Hence not the three times improvement. Invoice payment lookup should be O(log n), but payment lookup is probably much less efficient (not as certain, I am more familiar with the invoice side than the payment side, despite practically writing the current pay command implementation).

If you are doing multiple pay commands in parallel, in particular, some lookups are performed which are O(n) in number of payments (payments getting completed or failed in particular have a linear lookup).

/* Caller responsible for freeing ctx. */
static void waitsendpay_resolve(const tal_t *ctx,
struct lightningd *ld,
const struct sha256 *payment_hash,
const struct sendpay_result *result)
{
struct sendpay_command *pc;
struct sendpay_command *next;
list_for_each_safe(&ld->waitsendpay_commands, pc, next, list) {
if (!structeq(payment_hash, &pc->payment_hash))
continue;
/* Delete later (in our own caller) if callback did
* not delete. */
tal_steal(ctx, pc);
pc->cb(result, pc->cbarg);
}
}

@NicolasDorier
Copy link
Collaborator Author

is there a way to purge this data?

@ZmnSCPxj
Copy link
Collaborator

The linear list above is purged when a payment resolves: everything waiting for the payment to resolve is remove from the list, leaving only the payments that have not resolved yet. However there is still a database table on-disk that contains payments, and is also looked up when a payment resolves (succeeds or fails). The database table has an index on the payment_hash, so should be O(log n), though.

We would do better with concrete profiling information, though --- would that be possible to provide? gcc should have an option to create profiling information in particular. Especially it would be a good way to determine if the Alices or the Bob is slower.

@NicolasDorier
Copy link
Collaborator Author

What do you mean by "waiting payment to resolve"? I am not only creating invoice here, I am also paying them.

@NicolasDorier
Copy link
Collaborator Author

NicolasDorier commented May 16, 2018

I am creating a project to easily benchmark some scenaris https:/dgarage/LightningBenchmarks (it will also support other implementation) though GCC profiling would give better idea about particular code causing real issues. I can include it in the list of generated artefact.

@ZmnSCPxj
Copy link
Collaborator

The pay command (which is the command that understands bolt11 invoices and executes the algorithm to ensure the payment completes) waits for an individual payment attempt to complete, then if it fails, asks for another route and tries a new payment attempt. When the first HTLC in the routed payment is sent it is "waiting for payment to resolve". Multiple such pay commands to various invoice payment preimages are put in a list that the above quoted code queries using an O(n) algorithm (basically it just iterates over the list until it finds the pay command that is executing that payment attempt).

Payments are resolved if they are known to have succeeded, or known to have failed or timed out.

Note that in C-lightning, a "payment" is an object on the payer that tracks an individual payment attempt (an attempt at getting the preimage for a payment hash in exchange for an amount), while an "invoice" is an object on the payee that has a preimage and its hash and which is resolved when it releases the preimage in exchange for receiving money. So when I say "waiting for payment to resolve" I mean an object on the payer that is waiting for the routed payment to return a success or failure.

@NicolasDorier
Copy link
Collaborator Author

Yes but what I don't undesrtand is that this list of awaiting payments should be purged when pay get successfully routed. And they all are successfully routed, but still the time it takes increase.

@ZmnSCPxj
Copy link
Collaborator

I probably need to audit the database tables then... maybe we forgot to index something we are using as key in that table.

We are also using raw 32-byte blobs as keys in the database table, so maybe SQLITE does not handle those well...

@rockstardev
Copy link

@NicolasDorier I've just done similar rudimentary test with lnd I have as part of that docker integration I'm doing for BtcPayServer... and payment time does seem to stay constant.

So as @ZmnSCPxj says this may likely be problem with indexing or something similar that increases search time as number of transactions grow.

@NicolasDorier
Copy link
Collaborator Author

@cdecker suggested I use perftrace to find out where is the issue. I will do during next week.

I never used it though, so if someone can give me pointer it would help me a lot.

@ZmnSCPxj
Copy link
Collaborator

@NicolasDorier quick question, does Bob perform any waitinvoice or waitanyinvoice?

@NicolasDorier
Copy link
Collaborator Author

no

@NicolasDorier
Copy link
Collaborator Author

NicolasDorier commented May 21, 2018

Ok so I managed to get more info in the stack traces. Sadly I don't think it help.
alice2 and bob2 svgs have been taken some time after alice1 and alice2 svgs. There is not really any change.

traces.zip (more updated versions below)

If the issue is related to IO, would we see it in those graphs? I would assume we won't as the calls are asynchronous.

@ZmnSCPxj
Copy link
Collaborator

Looks like a good part is logging. Hmm. That and fork.

@NicolasDorier
Copy link
Collaborator Author

I am wondering if this is the full picture.

Because when you do a request to the DB, I don't think you block the thread so it would not show on the flamegraph, but it would still make the call slow.

@ZmnSCPxj
Copy link
Collaborator

We are using sqlite3, which to my understanding has its code execute in the same process as the caller. All our db calls block rather than have a result callback.

@NicolasDorier
Copy link
Collaborator Author

Trying to have more data point... running the script for longer taking flamegraph along the way... hopefully we can also see what is taking more and more time.

@NicolasDorier
Copy link
Collaborator Author

I am concerned also about why I don't see any call trace in pay or create invoice that I call by RPC.

@NicolasDorier
Copy link
Collaborator Author

I let it run for like 10-15 minutes.

At beginning (Alice1 and Bob1) 7 concurrent payments took 700ms, at the end (Alice3 and Bob3) took 3 seconds.

By opening the traces of different tab of the browser and switching via CTLR+TAB you can see the evolution over time Bob1, Bob2, Bob3 or Alice.

I am disappointed. While log_status_msg takes more and more time, it really does not stand that much out, so I don't think this is the cause of the slowdown. (However, it should be improved)

traces.zip

Also, why I can't see any call to the RPC methods?
I am using socat to talk to Bob and Alice RPC from the host, this is probably adding some overhead so I will try to remove this part tomorrow. I doubt this is the cause though.

@cdecker
Copy link
Member

cdecker commented May 21, 2018

The logging subsystem was never this taxing in any of my traces, that's surprising. Is the same true if you redirect to /dev/null? Any special shell you are using?

What bothers me is that the sqlite3_* calls are nowhere close to what I was expecting (they're really hard to find at all). But that's an issue I had in my own testing as well, anybody have a good idea on how to make these stacks include them?

@NicolasDorier
Copy link
Collaborator Author

NicolasDorier commented May 21, 2018

Yes, that surprise me as well. I will turn off logging and run again. (was debug)

@NicolasDorier
Copy link
Collaborator Author

I am using docker-compose. The output is not shown at all on the screen actually.

@NicolasDorier
Copy link
Collaborator Author

NicolasDorier commented May 21, 2018

Ok so, using debug-level=broken increased considerably the perfs (2x faster). But there is still a linear increase.

My CPU and IO is mainly unused.

Log stuff are still taking lot's of time. I am running lightningd &. Tomorrow I will try /dev/null if it improves anything...

I will also remove socat from the equation. Probably not the problem but it definitively add overhead.

Those are the last traces... and they are pretty boring, almost similar to previous one, with a bit less time on logging.

traces.zip

Will continue tomorrow...

@NicolasDorier
Copy link
Collaborator Author

NicolasDorier commented May 22, 2018

Ok. So I can't really remove socat from equation without massive headaches.

However, I refactored my code to limit the number of time I connect to a socket.
It improved the performances.

docker stats show I am near 100% CPU. While my Hyper-V VM hosting docker is using like 10% of CPU, and my host CPU is around 60% only... long story short I will run my tests on a separate hosted VM (digital ocean or Azure) so I can convince myself about the performances.

I still present my finding here:

The linear issue outlined is still unsolved. I let Alice and Bob mind their own business for 30 minutes while taking a flamegraph every 5 minutes.

My finding are:

Logs functions are suspiciously taking way more time than they should, even though I run with debug-level broken, culprits:

  • status_fmt
  • log_status_msg
  • tal_vfmt
  • get_first

I could not pin point any big issues aside #1518
It seems things come from ccan lib (ping @rustyrussell )

image

Also, huge amount of time is spent in __clone. I have no idea what it is.

I have still not identified the linear time issue.

Here are the latest flamegraphs I ran.
traces.zip

@ZmnSCPxj
Copy link
Collaborator

__clone is most likely the Linux clone syscall, or a wrapper thereof. It implements both the process-level fork and the thread-level pthread_create. We do not use threads explicitly, only fork. But the large amount of time it spends in __clone is very strange --- we do not expect modern OSs to have slow fork.

@NicolasDorier
Copy link
Collaborator Author

I am using alpine3.7 with some hacks for glib compat.

@NicolasDorier
Copy link
Collaborator Author

NicolasDorier commented May 22, 2018

fork isn't supposed to create a child process? Because I seen nothing like it being created by running ps.

@NicolasDorier
Copy link
Collaborator Author

NicolasDorier commented May 22, 2018

log-level to broken+ #1518 + not opening one socket for every request improved perf for 3 concurrents call dramatically (2500ms => 900ms after 16 * 100 payments)

However still consistent increase... tried to run those tests on a separate VM in case it was my PC's issue but no luck, same result.

benchmarks-runalicepaysbob-facettimelinesmooth

@ZmnSCPxj
Copy link
Collaborator

We create a number of child processes: hsmd, gossipd, and a channeld for each channel. But the fact that __clone dominates is very strange, since we do not run that many child processes.

Perhaps some library is using threads or processes under-the-hood. But in any case it seems we will not get any information from the __clone part very well.

The fact that the concurency=1 still has a slight slope implies some of the time is spent in some linear search of all payments or invoices.

@ZmnSCPxj
Copy link
Collaborator

I suddenly remembered that we also fork out calls to bitcoin-cli. The call itself is relatively transient, but we make a large number of those while tracking the chain. Hmm. Presumably we can ignore the __clone if so.

@NicolasDorier
Copy link
Collaborator Author

If so, why do you call bitcoin-cli so much when doing a payment? Normally making an invoice or payment does not involve anything on chain.

@NicolasDorier
Copy link
Collaborator Author

NicolasDorier commented May 23, 2018

I am starting to wonder if the reason I don't find anything relevant increasing in the flamegraph is because the slowdown is coming from inside the bitcoin-cli call.

(Tried also to run flamegraph on idle Alice, no __clone)

@NicolasDorier
Copy link
Collaborator Author

NicolasDorier commented May 23, 2018

@ZmnSCPxj I inserted a printf to see all bitcoin-cli commands. This is not the problem, no command is sent during the run.

@NicolasDorier
Copy link
Collaborator Author

I throw in the sponge. Let me know if I can try anything else to find out more info.
I will gather all the benchs and docs to reproduce it in a git repo as well as current results.

Feel free to close this issue except if you have idea where I can dig more about this.

@ZmnSCPxj
Copy link
Collaborator

Okay. I will try to find some time to analyze your gathered data in detail. This makes me wonder if there are some threads going on in the background in some library we are using; I begin to suspect sqlite3. Hmm.

@NicolasDorier
Copy link
Collaborator Author

NicolasDorier commented May 23, 2018

is there a way to hook __clone? Maybe I can try something with gdb?

@NicolasDorier
Copy link
Collaborator Author

NicolasDorier commented May 23, 2018

Tried to take a look with htop while it is running hoping to find more. lightningd and channeld seems to leak.

I don't think I can see the threads correctly with htop on alpine though.

@NicolasDorier
Copy link
Collaborator Author

I just don't understand... I added a breakpoint through gdb on __clone of lightningd_channel. At same time I run a perftrace... my breakpoint does not get hit but __clone show on the trace.

@ZmnSCPxj
Copy link
Collaborator

It is possible that the perftrace gets confused with regards to OS calls. OS calls might be using a completely separate stack, for example, so it appears to root from the original __clone that started up lightningd itself. I suggest that we should pay more attention to OS calls like poll, read, and write than on __clone.

Interesting tidbit is timer_earliest, which is somewhat fatter on the Alice_traces/33min.svg than in the Alice_traces/9min.svg. The pay RPC command, on starting, sets up a timer that will stop the payment attempt. Timers apparently use some sort of hierarchy of lists that I do not understand right now (summon @rustyrussell for help). There is a single struct timers used to handle all timers inside lightningd --- perhaps the complicated hierarchy of lists is getting more difficult to traverse as more individual struct timer objects are registered onto it. Indeed one of the fat parts is brute_force_first, which even helpfully has this comment:

/* FIXME: Suboptimal */

Another thing fatter in Alice_traces/33min.svg than in Alice_traces/9min.svg is the flame starting off timer_expired, which is called when we want to actually commit the set of HTLCs we were building into an actual commitment transaction (we do this every 10ms, so that multiple HTLCs incoming/outgoing have a chance of being put together in the same commitment). Fat parts on top of that include status_fmt, similar to log_* functions. The problem is that log filtering is implemented in the main lightningd, and status_* functions always send to the main lightningd regardless of your log level. Indeed it seems to me, naively, that a good part of the time spent in status_vfmt is towire_status_log and possibly the write system call (which we can see rooted from __clone).

So it seems to me that a good part of the delay on Alice is indeed logging due to the status_ system, with the log filtering done in lightningd rather than in the subdaemon (so that log-level=broken has no effect on the load).

Indeed, under __clone we see a good part of the time is from write on the lightning_channeld --- most likely from the status_* dumps --- and read on the main lightningd. It is very likely their communication channel since under sys_write and sys_read we see a good part go into sock_ related communications, then into something that starts with unix_, probably UNIX local sockets (which are what we use to communicate between subdaemons).

@NicolasDorier
Copy link
Collaborator Author

So it seems that we can gain some nice improvement by filtering log at the child level instead of reporting back to ligthningd. I don't think the logs are causing the linear issue, but still it would be good improvement.

I noticed with htop that there is something leaking as well. Maybe those timers are not properly cleanup after expiration. This might explain the linear time issue. (Although I think there is something else, brute_force_first might increase but it still seems not the biggest perf issue)

@NicolasDorier
Copy link
Collaborator Author

NicolasDorier commented May 24, 2018

Left column Alice, Right Bob. Vertical time (every 5 minutes)
Traces

Full resolution on https://aois.blob.core.windows.net/public/CLightningBenchmarkResults/traces.jpg

@NicolasDorier
Copy link
Collaborator Author

Raw data on all the tests I did.

@Saibato
Copy link
Contributor

Saibato commented May 27, 2018

dump_htlcs is called in channel.c when calculating the signature
which yield's to n(htlc's) call's of status_trace
so additional log get's bigger by n(htlc's) which consumes an by n increasing amount of time.
Can you try to comment out the call in channel.c and rerun test's?
Maybe dump_htlcs should be anyway SUPERVERBOSE

static void dump_htlc(const struct htlc *htlc, const char *prefix)
{
enum htlc_state remote_state;
if (htlc->state <= RCVD_REMOVE_ACK_REVOCATION)
remote_state = htlc->state + 10;
else
remote_state = htlc->state - 10;
status_trace("%s: HTLC %s %"PRIu64" = %s/%s %s",
prefix,
htlc_owner(htlc) == LOCAL ? "LOCAL" : "REMOTE",
htlc->id,
htlc_state_name(htlc->state),
htlc_state_name(remote_state),
htlc->r ? "FULFILLED" : htlc->fail ? "FAILED" :
htlc->malformed ? "MALFORMED" : "");
}
void dump_htlcs(const struct channel *channel, const char *prefix)
{
struct htlc_map_iter it;
const struct htlc *htlc;
for (htlc = htlc_map_first(channel->htlcs, &it);
htlc;
htlc = htlc_map_next(channel->htlcs, &it)) {
dump_htlc(htlc, prefix);
}
}

@ZmnSCPxj
Copy link
Collaborator

Thank you @Saibato, that is probably indeed the cause of the linear increase of processing time proportional to the number of payments done in parallel.

Probably the best thing to do in response is to propagate the loglevel to daemons somehow, and to suppress call to dump_htlcs if loglevel is too high. As a first cut, just sending the loglevel to channeld in the channel_init message, and making the subdaemon-level status_* functions include a new status_set_loglevel function, which will suppress the lower loglevel status_* functions (not make them send STATUS_* messages to lightningd). We would still have the filter at the lightningd level. Succeeding cuts could include this loglevel to other subdaemons as we identify cases where the logging greatly increases processing time.

@ZmnSCPxj
Copy link
Collaborator

Filtering at subdaemon level is not a good idea, actually. lightningd still saves the messages and it can be gotten at the getlog command. Perhaps the dump_htlcs can be reduced to be less verbose instead.

@NicolasDorier
Copy link
Collaborator Author

Will try today sorry for the delay!

@NicolasDorier
Copy link
Collaborator Author

@Saibato it worked. I just commented dump_htlcs.

@NicolasDorier
Copy link
Collaborator Author

benchmarks-runalicepaysbob-facettimelinesmooth

@NicolasDorier
Copy link
Collaborator Author

Closing this as the problem is identified.

Moving conversation to #1530

@NicolasDorier
Copy link
Collaborator Author

NicolasDorier commented Jun 1, 2018

Just reporting the landscape now

image
https://aois.blob.core.windows.net/public/temp/Alice.svg

image
https://aois.blob.core.windows.net/public/temp/Bob.svg

Seems better! Still can't get past 20 tx per second but at least it is stable. Does not seem to be bottlenecked by IO. Will play more with it, seeing if there is some low hanging fruits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants