Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

network: Use "one shot" protocol handler. #3520

Merged
merged 4 commits into from
Feb 12, 2020

Conversation

twittner
Copy link
Contributor

@twittner twittner commented Aug 30, 2019

Add two new NetworkBehaviours, one handling remote block requests and another one to handle light client requests (both local and from remote). The change is motivated by the desire to use multiple substreams of a single connection for different protocols. To achieve this, libp2p's OneShotHandler is used as a protocol handler in each behaviour. It will open a fresh substream for the duration of the request and close it afterwards. For block requests, we currently only handle incoming requests from remote and tests are missing. For light client handling we support incoming requests from remote and also ported a substantial amount of functionality over from light_dispatch.rs (including several tests). However the result lacks in at least two aspects:

  1. We require external updates w.r.t. the best block per peer and currently nothing updates this information.
  2. We carry a lot of peer-related state around.

Both aspects could be simplified by externalising peer selection and just requiring a specific peer ID where the request should be sent to. We still have to maintain some peer related state due to the way libp2p's swarm and network behaviour work (e.g. we must make sure to always issue NetworkBehaviourAction::SendEvents to peers we are connected to, otherwise the actions die a silent death).

Another change implemented here is the use of protocol buffers as the encoding for network messages. Certain individual fields of messages are still SCALE encoded. There has been some discussion about this in another PR (#3452), so far without resolution.

@gavofyork gavofyork added the A3-in_progress Pull request is in progress. No review needed at this stage. label Sep 1, 2019
@twittner twittner marked this pull request as ready for review September 2, 2019 16:29
@twittner twittner added A0-please_review Pull request needs code review. and removed A3-in_progress Pull request is in progress. No review needed at this stage. labels Sep 2, 2019
@twittner twittner requested a review from tomaka September 2, 2019 16:30
@arkpar
Copy link
Member

arkpar commented Sep 3, 2019

Is there any round-trip involved for opening a substream?
Does it increase network traffic?

@twittner
Copy link
Contributor Author

twittner commented Sep 3, 2019

Is there any round-trip involved for opening a substream?

Opening a new substream involves the use of multistream-select to negotiate the protocol to use with the stream. I think this involves at least 1 roundtrip currently. Version 2 of multistream-select will eventually improve on this.

Does it increase network traffic?

Depending on the multiplexer there might be an increase in control data. For example with Yamux there are WindowUpdate frames to implement flow control and those are sent per stream.

@arkpar
Copy link
Member

arkpar commented Sep 3, 2019

The change is motivated by the desire to use multiple substreams of a single connection for different protocols.

So what motivates this desire then? Extra roundtrips for each request does not sound like a good idea to me.

@tomaka
Copy link
Contributor

tomaka commented Sep 3, 2019

So what motivates this desire then? Extra roundtrips for each request does not sound like a good idea to me.

The main motivation is de-coupling the gossiping from block requests, so that:

  • Light clients and nodes in general can make requests without occupying a gossiping slot (which is the case right now). In other words, you wouldn't occupy a gossiping slot anymore while major syncing for example.
  • Nodes can close their gossiping channel without interrupting all on-going requests.
  • We can open multiple gossiping channels (for example one only within validators, one for the relay chain, and one for each parachain) without having the problem of which channel to send the request to.

@tomaka
Copy link
Contributor

tomaka commented Sep 3, 2019

Note that, theoretically, if/when multistream-select 2 happens, then that new system will be very slightly more efficient than what we currently have, as we would no longer transmit the type of message nor the request ID.

@arkpar
Copy link
Member

arkpar commented Sep 3, 2019

I don't see how adding extra network latency can be more efficient.

What exactly is "gossiping slot"?
The current protocol was designed in such a way that parallel requests are possible.

In general I'm OK with using a separate substream for gossiping, but not with opening a new substream for each request/response. One of our major goals is to minimize network latency and data propagation time.

@tomaka
Copy link
Contributor

tomaka commented Sep 3, 2019

What exactly is "gossiping slot"?

The slots whose number is controlled with the --in-peers and --out-peers options. What the peerset manager decides.
Unfortunately, right now, if our bootnodes are full, nobody can connect to the network anymore. This is a protocol design issue that is in the process of being solved by de-coupling gossiping from other requests.

I don't see how adding extra network latency can be more efficient.

Theoretically, there shouldn't be any additional round-trip compared to what we have right now.
Opening a substream is a very cheap operation. A substream is basically equivalent to a request ID.

The problem is that, due to the way the multistream-select protocol is designed, we have to wait for the remote to acknowledge that it supports the protocol we want (eg. "requesting a block").

There is an optimization in place in the code of multistream-select that allows skipping this round-trip, but it unfortunately can't apply automatically because if the first bytes that the Substrate protocol transmitted were "/multistream-select/1.0\n" then things could break.
Since this isn't the case, we could add a few hacks to the code of multistream-select to instruct it to always remove the round-trip.

Copy link
Contributor

@tomaka tomaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good to me.

@@ -109,6 +117,11 @@ impl<B: BlockT, S: NetworkSpecialization<B>, H: ExHashT> Behaviour<B, S, H> {
pub fn put_value(&mut self, key: record::Key, value: Vec<u8>) {
self.discovery.put_value(key, value);
}

/// Get unique access to the light client handler.
pub fn light_client_handler(&mut self) -> &mut protocol::LightClientHandler<Substream<StreamMuxerBox>, B> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is used anywhere, so I'd prefer to remove it, otherwise someone is going to use this method. The LightClientHandler is supposed to be an implementation detail, so not accessible directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you want to expose the request functionality of the light client handler? Do you want to reproduce its request method here and forward the call?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to reproduce its request method here and forward the call?

Yes!

max_block_data_response: 128,
max_request_len: 1024 * 1024,
inactivity_timeout: Duration::from_secs(15),
protocol: b"/polkadot/sync/1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there should be a default value for that.
The value should be format!("/{}/sync/1", protocol_id) (where protocol_id is the protocol_id of type ProtocolId that is passed around during the network's initialization), and that must be passed by parameter.

max_pending_requests: 128,
inactivity_timeout: Duration::from_secs(15),
request_timeout: Duration::from_secs(15),
protocol: b"/polkadot/light/1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same remark.

// random ID, we should not have an entry for it with this peer ID in
// our `outstanding` map, so a malicious peer should not be able to get
// us here. It is our own fault and must be fixed!
panic!("unexpected peer status {:?} for {}", info.status, peer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd go for calling log::error! and returning instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think it makes sense to continue past that point as the internal state is inconsistent. As mentioned in the comment, this should never be reached. If it is however, the programme as written is wrong and needs to be fixed.

It would be nice if the NetworkBehaviour trait provides a way to signal an error condition to the swarm which would then stop polling it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@twittner our codebase policy with panickers is to prove that they won't happen or remove them - is it not possible to recover again to a valid protocol state? mild programmer error should not take down a decentralized network

Copy link
Contributor Author

@twittner twittner Sep 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you prefer to keep the behaviour going no matter what, we could maybe attempt to reset all internal state.

Edit 1: However, since the swarm that manages this behaviour is unaware of it, we would not know about our peers anymore. So, we would have to keep parts of the state.

Edit 2: Also when removing request information we may punish peers if we receive responses for requests we have issued in the past but now forgot about.

Edit 3: I think that continuing under all circumstances is not advisable as it further complicates the set of possible states. As I mentioned in a comment above it would be nice if we could signal an error back to the swarm which could then decide what to do with this NetworkBehaviour, e.g. to reinitialise it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to kill the node gracefully then?

Copy link
Contributor Author

@twittner twittner Sep 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I am aware of. The NetworkBehaviour trait offers no such mechanism at the moment AFAIK. This would need to be addressed on libp2p-side first. CC @tomaka.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have opened libp2p/rust-libp2p#1256 to discuss this further.

Copy link
Contributor

@tomaka tomaka Sep 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't make sense to have in libp2p a generic mechanism to say "hey there's a bug in the code, please do something".
The NetworkBehaviour can however emit an event that can correspond to anything we want.

// expected. So, if we can not find an entry for it in our peer information table,
// then these two collections are out of sync which must not happen and is a clear
// programmer error that must be fixed!
panic!("missing peer information for {}; response {}", peer, id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same: error!

tomaka
tomaka previously approved these changes Sep 9, 2019
Copy link
Contributor

@tomaka tomaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@arkpar
Copy link
Member

arkpar commented Sep 12, 2019

I'm still not convinced that opening a substream for each request is a good idea. Until libp2p is able to support this without extra round trips and additional traffic this is a regression and there's no good reason to introduce it.

As for slot managements, there should be still a configurable limit on the number of simultaneous connections to prevent DOS/OOM attacks.

Unfortunately, right now, if our bootnodes are full, nobody can connect to the network anymore. This is a protocol design issue that is in the process of being solved by de-coupling gossiping from other requests.

This should be fixed by allowing additional slot for short-lived discovery connections and not by allowing unlimited connections.

@tomaka
Copy link
Contributor

tomaka commented Sep 12, 2019

Note that this PR adds support for this new clean protocol in addition to the dumb one we have right now. The default is still the dumb one.

This should be fixed by allowing additional slot for short-lived discovery connections and not by allowing unlimited connections.

The only way to do that is to open a second TCP connection, which is against the design of libp2p.

What the new protocol that this PR introduces does is make Substrate actually fit the design of libp2p, and what you're suggesting goes in the direction of going further away from how libp2p is designed.

@gavofyork
Copy link
Member

What the new protocol that this PR introduces does is make Substrate actually fit the design of libp2p

Until libp2p is able to support this without extra round trips and additional traffic

What are the chances of this ever actually happening? While it might be nice to tick the box of "designed to the ideals of the Gods of libp2p", performance should not go out of the window in pursuit of that.

@tomaka
Copy link
Contributor

tomaka commented Sep 12, 2019

What are the chances of this ever actually happening?

There are two ways to solve that:

  • Wait for the multistream-select v2 protocol to be specc'ed and implement it. We've been waiting for around 2 years now.
  • Say "fuck it" and implement our equivalent. We would no longer be compatible with js-libp2p and go-libp2p.

As time passes, I'm leaning more and more towards option 2.

@romanb
Copy link
Contributor

romanb commented Sep 12, 2019

What are the chances of this ever actually happening?
* Say "fuck it" and implement our equivalent. We would no longer be compatible with js-libp2p and go-libp2p.

As time passes, I'm leaning more and more towards option 2.

I don't think it's as bad as it sounds though. rust-libp2p could support both, a multistream-select v1 that supports 0-RTT negotiation (which I think would be identical on the wire, just differing in if and when negotiation frames are flushed) and the standard multistream-select v1 with the extra roundtrips. The choice would have to be static, of course, since 0-RTT is only really possible when the "dialer" only supports a single protocol at each level of the protocol stack (i.e. for each upgrade, including the initial multistream protocol). So an application using rust-libp2p would have to choose either the "improved" protocol and only be compatible with similarly configured rust-libp2p peers, or choose the standard protocol for maximum compatibility at the cost of extra roundtrips. This choice doesn't even need to be at the scope of the entirety of libp2p, however. It can be scoped to connections and substreams, so that standard libp2p protocols like libp2p-kad can remain compatible. I'm going to submit a small proposal for rust-libp2p shortly to demonstrate what I mean and which may settle the issue with the extra roundtrips here, by configuring the SubstreamProtocol in the two new network behaviours in this PR with a "0-RTT" version.

Also, since 0-RTT negotiation is by its very nature incompatible with supporting multiple, alternative protocols at any level, even if multistream-select 2.0 were already done and ready, if you would really want 0-RTT negotiation on substreams you would have to not support 1.0 at the same time, i.e. commit to 2.0 in the same manner as you would commit to a custom protocol, thereby dropping interoperability with all libp2p peers not speaking 2.0 (yet). You would only get both 0-RTT and full interoperability once 2.0 is supported and rolled out across all libp2p clients (and only until 3.0 ;-)), which is surely even further away than having a final 2.0 spec and implementations.

Is it fair to assume that it is not important for substrate to be compatible with peers potentially based on other libp2p implementations, at least for the foreseeable future (i.e. all peers of a substrate node are assumed to be, well, substrate nodes)?

@gavofyork
Copy link
Member

gavofyork commented Sep 13, 2019

sadly, i think we will need to retain a compatibility mode since there are at least three other languages (JS, Go, C++) creating a polkadot implementation and thus who we will need to talk to.

i'm good with the idea of having two alternatives for handling multistreams, where we use the efficient one if both ends support it and the "standard" one otherwise.

@romanb
Copy link
Contributor

romanb commented Sep 13, 2019

sadly, i think we will need to retain a compatibility mode since there are at least three other languages (JS, Go, C++) creating a polkadot implementation and thus who we will need to talk to.

I see. It may still be an option though to convince all these parties to adopt some additional assumptions around the use of multistream-select v1 that are necessary to make the use of it safe with "lazy" negotiation and thus 0-RTT negotiation on substreams, because I don't think it strictly requires any changes to the protocol itself (on the wire), just some additional assumptions on how it is implemented and knowledge of the (fixed set of) protocols being used. See libp2p/rust-libp2p#1212 and libp2p/rust-libp2p#1245 for further detail.

@gavofyork gavofyork added A3-in_progress Pull request is in progress. No review needed at this stage. and removed A0-please_review Pull request needs code review. labels Sep 21, 2019
@rphmeier
Copy link
Contributor

The choice would have to be static, of course, since 0-RTT is only really possible when the "dialer" only supports a single protocol at each level of the protocol stack (i.e. for each upgrade, including the initial multistream protocol)

This I don't quite understand. Is there no way to do a few round-trips in the beginning of a connection to negotiate on what kinds of substreams you could do 0-RTT negotiation for?

@tomaka
Copy link
Contributor

tomaka commented Sep 24, 2019

This I don't quite understand. Is there no way to do a few round-trips in the beginning of a connection to negotiate on what kinds of substreams you could do 0-RTT negotiation for?

Not "automatically" (as in, through a built-in mechanism in the libp2p protocol).

We could do it specifically for Substrate/Polkadot though.

@tomaka
Copy link
Contributor

tomaka commented Oct 25, 2019

@twittner Could you please update to the master branch? We might merge this soon.

@bkchr
Copy link
Member

bkchr commented Jan 28, 2020

Any update on this?

@twittner
Copy link
Contributor Author

I do not know if there is agreement in principle that the approach taken here would be merged. Other than that there are quite a few conflicts which need to be resolved once more.

@tomaka
Copy link
Contributor

tomaka commented Jan 29, 2020

I'm still in favour of merging this. Again, this code path isn't used in production yet and we will upgrade over time to see if it works first.
If our network protocol is to eventually make sense, we need either to merge this PR or to rewrite a similar PR later.

Add two new `NetworkBehaviour`s, one handling remote block requests
and another one to handle light client requests (both local and from
remote). The change is motivated by the desire to use multiple
substreams of a single connection for different protocols. To achieve
this, libp2p's `OneShotHandler` is used as a protocol handler in each
behaviour. It will open a fresh substream for the duration of the
request and close it afterwards. For block requests, we currently only
handle incoming requests from remote and tests are missing. For light
client handling we support incoming requests from remote and also
ported a substantial amount of functionality over from
`light_dispatch.rs` (including several tests). However the result lacks
in at least two aspects:

(1) We require external updates w.r.t. the best block per peer and
currently nothing updates this information.
(2) We carry a lot of peer-related state around.

Both aspects could be simplified by externalising peer selection and
just requiring a specific peer ID where the request should be sent to.
We still have to maintain some peer related state due to the way
libp2p's swarm and network behaviour work (e.g. we must make sure to
always issue `NetworkBehaviourAction::SendEvent`s to peers we are
connected to, otherwise the actions die a silent death.

Another change implemented here is the use of protocol buffers as the
encoding for network messages. Certain individual fields of messages
are still SCALE encoded. There has been some discussion about this
in another PR (paritytech#3452), so
far without resolution.
@twittner twittner requested a review from tomaka February 10, 2020 15:59
@twittner
Copy link
Contributor Author

Updated and rebased. Ready for review again.

@twittner twittner added A0-please_review Pull request needs code review. and removed A3-in_progress Pull request is in progress. No review needed at this stage. labels Feb 10, 2020
@twittner twittner dismissed tomaka’s stale review February 10, 2020 16:01

Too much time has passed since this approval.

Copy link
Contributor

@tomaka tomaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design of the code is the same as last time, and it looks good to me.
I didn't review in small details to spot any possible typo or something, but the tests coverage gives the confidence that this works well.

@tomaka
Copy link
Contributor

tomaka commented Feb 12, 2020

For the sake of not leaving this PR open for a couple more months, I'm going to merge it.
As discussed above, these code paths aren't used by default yet.

@tomaka tomaka merged commit ea721a1 into paritytech:master Feb 12, 2020
@twittner twittner deleted the new-behaviour branch February 12, 2020 10:56
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants