Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop ping sending rate-limit suggestion #918

Merged
merged 1 commit into from
Dec 9, 2021

Conversation

TheBlueMatt
Copy link
Collaborator

There's not a lot of reason to explicitly rate-limit ping messages
on the sending side, hosts on the internet can always send you
as much traffic as they want, its up to you whether you want to
talk back to them.

This seems to have been intended as a cutoff where nodes can skip
responding to pings below a certain rate, but in practice 30
seconds is much too long a time to learn that your peer has
disconnected.

We could reduce the threshold, but its not like this is the only
place in the spec where a peer can request a message response, and
that is unlikely to change, making it of highly dubious value.

@m-schmoock
Copy link
Collaborator

m-schmoock commented Sep 27, 2021

What about the receiver?

SHOULD fail the channels if it has received significantly in excess of one `ping` per 30 seconds.

Also we could add a note the sender must not send another ping when it has not yet received a pong,
because TCP jam could make it look like the sender is flooding pings.

There's not a lot of reason to explicitly rate-limit ping messages
on the sending side, hosts on the internet can *always* send you
as much traffic as they want, its up to you whether you want to
talk back to them.

This seems to have been intended as a cutoff where nodes can skip
responding to pings below a certain rate, but in practice 30
seconds is much too long a time to learn that your peer has
disconnected.

We could reduce the threshold, but its not like this is the only
place in the spec where a peer can request a message response, and
that is unlikely to change, making it of highly dubious value.
@TheBlueMatt
Copy link
Collaborator Author

What about the receiver?

Oops, sorry.

Also we could add a note the sender must not send another ping when it has not yet received a pong

I dunno if its worth bothering? If there is a concern over outbound buffer fills or inbound traffic floods, rate-limiting pings does ~nothing to address that. Both must be addressed at a different level as there will never not be a way to get someone to send a response message in the spec.

@m-schmoock
Copy link
Collaborator

Since you removed SHOULD fail the channels if it has received significantly in excess of one ping per 30 seconds.
there is no requirement anymore to mention to suspending further pings until we received a pong. Before it was 'technically inconsistent' as it told a node to FAIL CHANNELS when actually there may have just been a TCP buffer congestion.

@TheBlueMatt
Copy link
Collaborator Author

Ah, yes, I'd misunderstood the context for your comment, agreed.

@Crypt-iQ
Copy link
Contributor

What would be a reasonable amount of time to know that your peer has been disconnected?

@TheBlueMatt
Copy link
Collaborator Author

We've switched to 5 seconds, which is probably too long in most cases but too short on Tor. There's no right answer, so the spec shouldn't be opinionated about it.

@Crypt-iQ
Copy link
Contributor

Crypt-iQ commented Oct 5, 2021

A ping-timeout of 5 seconds might cause rust-lightning to disconnect from a channel peer if they are under heavy load (e.g. packet flood). Is the assumption you're making that this doesn't happen? On another note, I do agree with this spec pull that 30seconds is arbitrary.

@TheBlueMatt
Copy link
Collaborator Author

Five seconds is a huge amount of time (for non-tor connections). Getting a message shouldn't take longer than a second, and getting it back shouldn't take longer than a second. Taking three seconds to process a message probably indicates the peer is blocked on a disk write writing to a disk connected via smoke signals. I admit lots of nodes are basically smoke-signal-operated (ie running on RPis with SD cards), but the flip side has real nontrivial cost - accepting ping timeouts of more than ~500ms implies you may not detect a TCP socket getting killed when your app went to the background. This appears to be especially nasty on iOS where sockets hang when the app is in the background, and users want to open an app and hit "send" right away. You can ping right before the send, but of course now you've added an extra 250ms to your payment latency.

Obviously I'm being a bit facetious here, but the point is there isn't a really good answer - it depends a lot on the peer you're talking to (is it an LSP running on "real" hardware over clearnet or is it an RPi that goes on vacation for seconds at a time and communicates over Tor), your network (Tor or 2G modem or a datacenter?), etc. The spec should not have an opinion on the ping latency, and its up to users to tune their deployment to what makes the sense.

Copy link
Collaborator

@t-bast t-bast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK

@m-schmoock
Copy link
Collaborator

ACK

Rational on not adding a "no more than 1 inflight ping" rule (yet or at all) is that it would only be enforceable if the responder (pong sender) needs to set a random number that the sender needs to use on the next ping in order for it to be valid.

Since this is bikeshedding to add without need, we can just drop the rate limit and care if that ever gets a problem at all.

@TheBlueMatt
Copy link
Collaborator Author

cc @Roasbeef @Crypt-iQ there was some hesitation from the lnd side in the meeting - noting that y'all were thinking of maybe adding a rate-limit in the future. Obviously that would be incompatible with this PR as written, so if there's hesitation/a desire to do that in the future we should probably discuss it now rather than later. On the LDK end, we may actually start pinging even more often, using it to gate how quickly we send gossip messages to ensure we're always able to get channel messages through quickly even during full gossip sync. We'll still only have one in-flight ping at once, but may end up with pings happening at almost the RTT of a peer.

@Roasbeef
Copy link
Collaborator

Roasbeef commented Nov 8, 2021

. Obviously that would be incompatible with this PR as written, so if there's hesitation/a desire to do that in the future we should probably discuss it now rather than later.

We haven't made any strong commitments, but just that it's something we considered doing after a newer member of the team brought up concerns re not enforcing the deadline, and being flooded with ping messages. I think it'll be one of those reactive things, where if there's an issue, or people report their node is handling too many (?) pings, we'd start to throttle.

As I mentioned before, I don't have a strong opinion on this.

@TheBlueMatt
Copy link
Collaborator Author

We haven't made any strong commitments, but just that it's something we considered doing after a newer member of the team brought up concerns re not enforcing the deadline, and being flooded with ping messages. I think it'll be one of those reactive things, where if there's an issue, or people report their node is handling too many (?) pings, we'd start to throttle.

As I mentioned before, I don't have a strong opinion on this.

These two things are mutually exclusive - "we've removed it from the spec and well-behaved implementations can do ~whatever" and "we may add a rate-limiter later, dunno, YOLO" are incompatible from each other. One thing we've been looking at on the LDK side is to rate-limit gossip sync by sending pings in between to ensure we keep consistent low latency. This would imply substantial ping volume, but not as a % of total messages.

@TheBlueMatt
Copy link
Collaborator Author

Would you feel better if this PR was amended to say something like "no more than 50% of messages on the wire may be pings if more than one message is received per second"? That way you can still rate-limit to your heart's content later but it doesn't impact others.

@rustyrussell
Copy link
Collaborator

Ack. We don't enforce a limit either, and if you're being painful we will have to start anyway, whatever spec says :)

@Roasbeef
Copy link
Collaborator

Roasbeef commented Nov 8, 2021

If I say: "we don't plan to rate limit in the immediate future, but will examine it on an on going basis", are you satisfied as far as this PR? I think this is just a communication/language thing: I'm expressing the desire for optionality in the future, but I'm not committing to anything (as I don't know what the future holds, if someone starts spamming and destabilizing the network, we'll put in mitigations at the implementation level).

Would you feel better if this PR was amended to say something like "no more than 50% of messages on the wire may be pings if more than one message is received per second"?

Sure doesn't seem to hurt, also echoes @rustyrussell's comments earlier today about a heuristic that could be used to decide if peers are misbehaving.

I think this PR is fine to land, we don't do rate limiting atm, but only send pings every 15 seconds or so.

@Roasbeef
Copy link
Collaborator

Roasbeef commented Nov 8, 2021

One thing we've been looking at on the LDK side is to rate-limit gossip sync by sending pings in between to ensure we keep consistent low latency.

Can you elaborate on this? Doesn't that assume certain implementation-level processing/prioritization details (particularly the internal queueing system used)?

@TheBlueMatt
Copy link
Collaborator Author

If I say: "we don't plan to rate limit in the immediate future, but will examine it on an on going basis", are you satisfied as far as this PR?

I think you've been saying that for a while, I don't see how that changes anything. I don't think anyone is suggesting we can't revisit this if someone starts ping-spamming (not that I buy that it would be useful given they would just start udp-flood-spamming or some other messages), but what you're saying, to me, is "well, we reserve the right to ignore the spec in this regard later, but you go ahead and do you", which isn't helpful here. There is no "well, we can merge this but maybe we'll ignore it later", the spec exists so that we can agree on something and be compatible, not so that we can throw it out later cause its annoying. If you want to reserve the right to do something that the spec forbids in the future, the spec must not forbid it. I agree there is a communication issue here, but I'm entirely unsure of why this is such an issue.

Sure doesn't seem to hurt, also echoes @rustyrussell's comments earlier today about a heuristic that could be used to decide if peers are misbehaving.

My question isn't "does it hurt" its "what will lnd commit to not disconnecting for in the future". ie "what can we put in the spec to ensure compatibility with the various implementations, which is the whole point of the spec...

Can you elaborate on this? Doesn't that assume certain implementation-level processing/prioritization details (particularly the internal queueing system used)?

Not directly - we don't actually care if a peer prioritizes pings and responds to them immediately with gossip handled in the background (modulo TCP buffering delays and the like), all we care about is making sure our pings get through in a timely manner so that we can quickly detect disconnection.

@Crypt-iQ
Copy link
Contributor

Crypt-iQ commented Nov 11, 2021

Just a question about your example: what if I don't have a udp port open on my router? An attacker could fill up my bandwidth-pipes, but it's a much higher threshold than if I had a tcp port open and the kernel keeps getting interrupted for each received packet. Right?

@TheBlueMatt
Copy link
Collaborator Author

Just a question about your example: what if I don't have a udp port open on my router? An attacker could fill up my bandwidth-pipes, but it's a much higher threshold than if I had a tcp port open and the kernel keeps getting interrupted for each received packet. Right?

I doubt that? 1Gbps of UDP garbage is pretty easy for modern Linux to shrug off, but 1Gbps of UDP garbage is probably enough to make most home internet connections drop 20-50% of inbound packets, more than enough to make TCP unusable.

Of course even if your ISP has some DDoS protection in place you can always replace "UDP flood" with "9735-syn flood" or even "flood of random odd-value messages" or "flood of repeating the same update_channel message over and over again" or.....

The "standard" method for these kinds of things is to track the "useful-message-vs-useless-message" ratio for any peer and get upset if its too high, but that's somewhat hard for lightning because you're expected to send update_channel messages a bunch which are duplicative with what other nodes sent, and you have to be really careful not to let any node flood you with a message that meets the "useful" criteria (like sending the same update_fee over and over again, or a different update_fee by 1 sat or...). Luckily in lightning we have a concept of "I like that peer" - "I have a channel with that peer", and you can firewall out other peers as needed, which seems like the only "real" solution to all of this.

@Crypt-iQ
Copy link
Contributor

Crypt-iQ commented Nov 11, 2021

I doubt that? 1Gbps of UDP garbage is pretty easy for modern Linux to shrug off, but 1Gbps of UDP garbage is probably enough to make most home internet connections drop 20-50% of inbound packets, more than enough to make TCP unusable.

As an aside: I have a 1 year old arch linux install running on an five-ish year-old macbook. It cannot shrug off a packet flood and I don't think I'm hitting 1Gbps. Though it could be the age of the machine. I used prof on it, and it showed a livelock due to IRQ (interrupt receive queue). bitcoind dropped all of its connections as a result. And I could still use my internet on my work machine!

Luckily in lightning we have a concept of "I like that peer" - "I have a channel with that peer", and you can firewall out other peers as needed, which seems like the only "real" solution to all of this.

So I think this is the model that Laolu is referring to when saying rate-limiting pings might be a thing in lnd - not accepting connections except for certain peers we have channels with. Home node users / plebnet users without DDoS protection are not included here. So the unconnected-attacker's packet flood does not apply. If you bypass a threshold of sending too much junk, you probably should be disconnected (not banned because of Tor). Though determining a threshold here is pretty tricky if you can send almost-useless commit_sig regardless of any other ping-pong-threshold. If rust-lightning doesn't bypass the threshold, there would be no reason to close out connections. I don't think the honest settings of rust-lightning would be cause for a disconnect, but that's just my opinion.

@TheBlueMatt
Copy link
Collaborator Author

TheBlueMatt commented Nov 11, 2021

five-ish year-old macbook ... livelock

This sounds like a bug in your NIC driver, plus, yes, that's probably a pretty old machine for the purposes of this discussion.

So I think this is the model that Laolu is referring to when saying rate-limiting pings might be a thing in lnd - not accepting connections except for certain peers we have channels with.

How does this have anything to do with rate-limiting pings?

Home node users / plebnet users without DDoS protection are not included here. So the unconnected-attacker's packet flood does not apply.

Huh?

If you bypass a threshold of sending too much junk, you probably should be disconnected (not banned because of Tor). Though determining a threshold here is pretty tricky if you can send almost-useless commit_sig regardless of any other ping-pong-threshold. If rust-lightning doesn't bypass the threshold, there would be no reason to close out connections. I don't think the honest settings of rust-lightning would be cause for a disconnect, but that's just my opinion.

So we're back where we started? Let me rephrase the whole issue here - the PR text, as currently written, says that a peer can literally spin-loop pings because it wants to make the connection constant-bit-rate. This is in fact a totally reasonable thing to do for privacy! Set the connection to 1Mbps by sending pings in a loop!

If merged as-is, the spec says you MUST NOT disconnect just because a peer is doing that. Rate-limiting and not responding to these pings would be a spec violation, you MUST NOT do this. Y'all keep coming in and saying "well, we may do that in the future" which implies that you are NACK'ing this PR, because you're saying you want to (maybe) violate this proposed spec change in the future.

Obviously if you don't have a channel with a peer, you don't generally have to accept a connection with it, who cares, not your problem, but if you have a channel, this PR would require you to be able to handle ping "floods" as a condition of being a lightning node that cares about being able to stay connected to channel peers.

My point is that you are likely confused in wanting to do that explicitly, hence the conversation about the value of such rate-limiting, and I think this PR still makes sense. If you still disagree, I proposed alternate wording above.

@Crypt-iQ
Copy link
Contributor

Crypt-iQ commented Nov 12, 2021

How does this have anything to do with rate-limiting pings?

You had previously brought up that there was no point to doing this since you could packet flood anyways. A random attacker won't be able to packet flood since we are working with a different model (cloud deployment with DDoS protection, whitelisting).

This is in fact a totally reasonable thing to do for privacy! Set the connection to 1Mbps by sending pings in a loop!

I had thought the motivation was for your mobile users and not privacy. If a node has 1000 channels, and each is spamming you with pings, you are now subject to a 1Gbps ping flood? Why would you not want to discourage this?

If merged as-is, the spec says you MUST NOT disconnect just because a peer is doing that

I don't see the MUST NOT clause? It seems pretty open-ended

@TheBlueMatt
Copy link
Collaborator Author

You had previously brought up that there was no point to doing this since you could packet flood anyways. A random attacker won't be able to packet flood since we are working with a different model (cloud deployment with DDoS protection, whitelisting).

Rate-limiting pings still doesn't solve your issue, though, I can send you an update_fee message in a loop, or repeat gossip messages over and over again, or even sign new gossip messages in a loop so that they're "useful".

I had thought the motivation was for your mobile users and not privacy. If a node has 1000 channels, and each is spamming you with pings, you are now subject to a 1Gbps ping flood? Why would you not want to discourage this?

There are many reasons why allowing more pings could be useful. Indeed, in our case we had some mobile clients that had issues detecting disconnections and we had to ship a faster ping time in order for things to work. Constant bandwidth lightning connections is also probably something we need to move towards for privacy reasons, and pings aren't a terrible way to go about it, though dummy unknown odd messages may be even better. Maybe 1Mbps is too much, I dunno, but the point being just a raw rate-limit of pings isn't helpful in any context I can come up with.

I don't see the MUST NOT clause? It seems pretty open-ended

  • if num_pong_bytes is less than 65532:
  • MUST respond by sending a pong message, with byteslen equal to num_pong_bytes.

@Crypt-iQ
Copy link
Contributor

Rate-limiting pings still doesn't solve your issue, though, I can send you an update_fee message in a loop, or repeat gossip messages over and over again, or even sign new gossip messages in a loop so that they're "useful".

We rate-limit gossip messages, rate-limiting pings if they're excessive isn't too much of a difference in my opinion. The same goes for rate-limiting other messages depending on the amount of work it costs us to process. This pull could use a feature-bit and then lnd just won't have to implement the feature-bit which seems like the best of both worlds?

Constant-bandwidth connections makes sense in theory to me, but seems like it is not that great in certain use-cases (high channel count). So I don't think it's enough of a reason for us to not rate-limit pings if they're excessive.

@TheBlueMatt
Copy link
Collaborator Author

We rate-limit gossip messages, rate-limiting pings if they're excessive isn't too much of a difference in my opinion.

Do you also rate-limit channel messages?

This pull could use a feature-bit and then lnd just won't have to implement the feature-bit which seems like the best of both worlds?

Dear god, please lets not start shoving in feature bits for trivial things, that just makes everyone's code more complicated. If you have an issue with the PR, lets discuss how to make it better such that it fits within your idea of what lnd will accept.

Constant-bandwidth connections makes sense in theory to me, but seems like it is not that great in certain use-cases (high channel count). So I don't think it's enough of a reason for us to not rate-limit pings if they're excessive.

Sure, you'd have to do it at a reasonable rate, but let's say its 64Kbps, that's probably more than enough for any channel updates to not be delayed, but also means you'll have a connection that's 99% ping messages, making "useful-messages"-based peer criteria though.

@Crypt-iQ
Copy link
Contributor

Do you also rate-limit channel messages?

No, but it's something I've idly thought about before.

If you have an issue with the PR, lets discuss how to make it better such that it fits within your idea of what lnd will accept.

Sure

I see in this pull that there's no mention of a timely response for pings. What happens if rust-lightning gets a response that is delayed - I recall you mentioning that you wanted to know if your peers were up or not? Additionally, are you padding your pings out and if so, by how much?

@TheBlueMatt
Copy link
Collaborator Author

I see in this pull that there's no mention of a timely response for pings. What happens if rust-lightning gets a response that is delayed - I recall you mentioning that you wanted to know if your peers were up or not?

Disconnect + reconnect, cause it probably means the socket is actually closed but the host didn't tell us, at least on iOS.

Additionally, are you padding your pings out and if so, by how much?

Currently I believe not, but it may be something useful in the future if/when we implement constant bandwidth mode.

@Crypt-iQ
Copy link
Contributor

Disconnect + reconnect, cause it probably means the socket is actually closed but the host didn't tell us, at least on iOS.

But do you not end up requiring your peers to have a certain bandwidth or is that not an issue except for the potato-router model?

@TheBlueMatt
Copy link
Collaborator Author

But do you not end up requiring your peers to have a certain bandwidth or is that not an issue except for the potato-router model?

Yea, I mean I think we require a full packets per second or so! If you don't get that, you probably can't be considered "Internet-connected" lol.

@Crypt-iQ
Copy link
Contributor

Yea, I mean I think we require a full packets per second or so! If you don't get that, you probably can't be considered "Internet-connected" lol.

Would the constant-bandwidth model require a certain bandwidth? And if so, how would it work if it requires 1Mbps while my speed is 500Kbps?

@TheBlueMatt
Copy link
Collaborator Author

Would the constant-bandwidth model require a certain bandwidth? And if so, how would it work if it requires 1Mbps while my speed is 500Kbps?

Yes, it would disconnect. I suppose 1Mbps was a bit overly aggressive - a "real" CBR model here would probably be more like: (a) open a separate, non-CBR connection that you use only for gossip, (b) set the HTLC-only connection to CBR at, like, 32Kbps or 128Kbps or so.

Copy link
Collaborator

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🌛

@TheBlueMatt
Copy link
Collaborator Author

@Crypt-iQ what is your current thinking on this one? Should we just land it and amend the spec here if y'all decide to go down a ratelimiting path later (presuming its not something that'll happen fast/be needed any time soon)? Should we set a limit here that people are happy with? Or just rely on the general "well, if you dont have a channel with them, you dont have to stay connected in any case, really" take?

@Crypt-iQ
Copy link
Contributor

Crypt-iQ commented Dec 7, 2021

I think this pull is fine and we may not need to go down the ratelimiting path at all

Copy link
Collaborator

@t-bast t-bast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 49e1c1c

Copy link
Contributor

@lightning-developer lightning-developer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 49e1c1c

As @rustyrussell said: c-Lightning does not enforce this. they have a FIXME in their code to be spec compliant at some point and include the 30 seconds limit: https:/ElementsProject/lightning/blob/d22fd599973175e3d1a484a7aa2efd9916116bc5/common/ping.c#L16

Unless I oversaw something Eclair currently also does not enforce the rate-limiting: https:/ACINQ/eclair/blob/59b403559bdfc0ba884503bb12a14846b7aa5d2d/eclair-core/src/main/scala/fr/acinq/eclair/io/PeerConnection.scala#L190 They seem however to send a heartbeat ping every 30 seconds if no other message was sent and have the ping interval as a config parameter.

Electrum seemed to not have implemented a rate limiting of 30 seconds for incoming ping messages: https:/spesmilo/electrum/blob/56b03e2e8d6923c4d396c1272d3fafed57919a8f/electrum/lnpeer.py#L230

I did not check rust-lightning as @TheBlueMatt proposed the PR

I also did not check lnd as @Roasbeef and @Crypt-iQ acked the PR after the lengthy discussion.

Given the current code of c-lightning, eclair and electrum we seem to have met the 2 implementation threshold and should merge.

@t-bast
Copy link
Collaborator

t-bast commented Dec 9, 2021

Unless I oversaw something Eclair currently also does not enforce the rate-limiting: https:/ACINQ/eclair/blob/59b403559bdfc0ba884503bb12a14846b7aa5d2d/eclair-core/src/main/scala/fr/acinq/eclair/io/PeerConnection.scala#L190 They seem however to send a heartbeat ping every 30 seconds if no other message was sent and have the ping interval as a config parameter.

Thanks for looking into it, that's correct, we don't enforce the rate-limiting, that's why we're fine with this PR as even old eclair nodes will be compliant with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants