Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

p2p: new dial scheduler #20592

Merged
merged 18 commits into from
Feb 13, 2020
Merged

p2p: new dial scheduler #20592

merged 18 commits into from
Feb 13, 2020

Conversation

fjl
Copy link
Contributor

@fjl fjl commented Jan 24, 2020

This change replaces the peer-to-peer dial scheduler with a new and
improved implementation. The new code is better than the previous
implementation in two key aspects:

  • The time between discovery of a node and dialing that node is
    significantly lower in the new version. The old dialState kept a
    buffer of nodes and launched a task to refill it whenever the buffer
    became empty. This worked well with the discovery interface we used
    to have, but doesn't really work with the new iterator-based
    discovery API.

  • Selection of static dial candidates (created by Server.AddPeer or
    through static-nodes.json) performs much better for large amounts of
    static peers. Connections to static nodes are now limited like
    dynamic dials and can no longer overstep MaxPeers or the dial ratio.

There are a few further tweaks to peer handling in this PR:

  • The fallback logic for connecting to the bootnodes when discovery
    doesn't work is removed. This change will impede connectivity for
    people without UDP access. Removing this feature is OK for public
    testnets and the mainnet because DNS node list can take its place.
    For private networks, users without working discovery must switch to
    static peering (good choice for small networks) or set up custom DNS
    lists.

  • --maxpeers=1 works again and now allocates a dial slot instead of
    a listener slot.

  • RemovePeer is now synchronous. This should generally be backwards
    compatible, but can lead to a deadlock if called from within a
    protocol. Protocol implementations should use Disconnect instead of
    RemovePeer.

  • Connection logging is improved and the output of geth --vmodule=p2p=5
    is a lot more useful now. We previously logged outgoing connection
    failures up to three times, using different log fields and messages.
    The changes in this PR ensure a single log message is emitted per
    failure and all messages use the id=... addr=... conn=... format
    (with id= omitted while the node ID isn't known yet).

This change replaces the peer-to-peer dial scheduler with a new and
improved implementation. The new code is better than the previous
implementation in two key aspects:

- The time between discovery of a node and dialing that node is
  significantly lower in the new version. The old dialState kept
  a buffer of nodes and launched a task to refill it whenever the buffer
  became empty. This worked well with the discovery interface we used to
  have, but doesn't really work with the new iterator-based discovery
  API.

- Selection of static dial candidates (created by Server.AddPeer or
  through static-nodes.json) performs much better for large amounts of
  static peers. Connections to static nodes are now limited like dynanic
  dials and can no longer overstep MaxPeers or the dial ratio.
@fjl
Copy link
Contributor Author

fjl commented Jan 27, 2020

Already found a minor issue with this: RemovePeer no longer disconnects the peer.

Always allocate at least one dial slot unless dialing is disabled using
NoDial or MaxPeers == 0. Most importantly, this fixes MaxPeers == 1 to
dedicate the sole slot to dialing instead of listening.
Also make RemovePeer synchronous and add a test.
We previously logged outgoing connection failures up to three times.

- in SetupConn() as "Setting up connection failed addr=..."
- in setupConn() with an error-specific message and "id=... addr=..."
- in dial() as "Dial error task=..."

This commit ensures a single log message is emitted per failure and adds
"id=... addr=... conn=..." everywhere (id= omitted when the ID isn't
known yet).

Also avoid printing a log message when a static dial fails but can't be
resolved because discv4 is disabled. The light client hit this case all
the time, increasing the message count to four lines per failed
connection.
@fjl
Copy link
Contributor Author

fjl commented Feb 9, 2020

@karalabe @holiman please re-review

@holiman
Copy link
Contributor

holiman commented Feb 10, 2020

I can't really meaningfully review this ... Maybe let's just merge it to master after next release and see how it performs?

enriquefynn pushed a commit to enriquefynn/go-ethereum that referenced this pull request Mar 10, 2021
* p2p: new dial scheduler

This change replaces the peer-to-peer dial scheduler with a new and
improved implementation. The new code is better than the previous
implementation in two key aspects:

- The time between discovery of a node and dialing that node is
  significantly lower in the new version. The old dialState kept
  a buffer of nodes and launched a task to refill it whenever the buffer
  became empty. This worked well with the discovery interface we used to
  have, but doesn't really work with the new iterator-based discovery
  API.

- Selection of static dial candidates (created by Server.AddPeer or
  through static-nodes.json) performs much better for large amounts of
  static peers. Connections to static nodes are now limited like dynanic
  dials and can no longer overstep MaxPeers or the dial ratio.

* p2p/simulations/adapters: adapt to new NodeDialer interface

* p2p: re-add check for self in checkDial

* p2p: remove peersetCh

* p2p: allow static dials when discovery is disabled

* p2p: add test for dialScheduler.removeStatic

* p2p: remove blank line

* p2p: fix documentation of maxDialPeers

* p2p: change "ok" to "added" in static node log

* p2p: improve dialTask docs

Also increase log level for "Can't resolve node"

* p2p: ensure dial resolver is truly nil without discovery

* p2p: add "looking for peers" log message

* p2p: clean up Server.run comments

* p2p: fix maxDialedConns for maxpeers < dialRatio

Always allocate at least one dial slot unless dialing is disabled using
NoDial or MaxPeers == 0. Most importantly, this fixes MaxPeers == 1 to
dedicate the sole slot to dialing instead of listening.

* p2p: fix RemovePeer to disconnect the peer again

Also make RemovePeer synchronous and add a test.

* p2p: remove "Connection set up" log message

* p2p: clean up connection logging

We previously logged outgoing connection failures up to three times.

- in SetupConn() as "Setting up connection failed addr=..."
- in setupConn() with an error-specific message and "id=... addr=..."
- in dial() as "Dial error task=..."

This commit ensures a single log message is emitted per failure and adds
"id=... addr=... conn=..." everywhere (id= omitted when the ID isn't
known yet).

Also avoid printing a log message when a static dial fails but can't be
resolved because discv4 is disabled. The light client hit this case all
the time, increasing the message count to four lines per failed
connection.

* p2p: document that RemovePeer blocks
tony-ricciardi pushed a commit to tony-ricciardi/go-ethereum that referenced this pull request Jan 20, 2022
* celo-blockchain cherry-pick from upstream: p2p: new dial scheduler (ethereum#20592)

Conflicts:
  p2p/dial.go
  p2p/server.go
  p2p/server_test.go

* p2p: new dial scheduler

This change replaces the peer-to-peer dial scheduler with a new and
improved implementation. The new code is better than the previous
implementation in two key aspects:

- The time between discovery of a node and dialing that node is
  significantly lower in the new version. The old dialState kept
  a buffer of nodes and launched a task to refill it whenever the buffer
  became empty. This worked well with the discovery interface we used to
  have, but doesn't really work with the new iterator-based discovery
  API.

- Selection of static dial candidates (created by Server.AddPeer or
  through static-nodes.json) performs much better for large amounts of
  static peers. Connections to static nodes are now limited like dynanic
  dials and can no longer overstep MaxPeers or the dial ratio.

* p2p/simulations/adapters: adapt to new NodeDialer interface

* p2p: re-add check for self in checkDial

* p2p: remove peersetCh

* p2p: allow static dials when discovery is disabled

* p2p: add test for dialScheduler.removeStatic

* p2p: remove blank line

* p2p: fix documentation of maxDialPeers

* p2p: change "ok" to "added" in static node log

* p2p: improve dialTask docs

Also increase log level for "Can't resolve node"

* p2p: ensure dial resolver is truly nil without discovery

* p2p: add "looking for peers" log message

* p2p: clean up Server.run comments

* p2p: fix maxDialedConns for maxpeers < dialRatio

Always allocate at least one dial slot unless dialing is disabled using
NoDial or MaxPeers == 0. Most importantly, this fixes MaxPeers == 1 to
dedicate the sole slot to dialing instead of listening.

* p2p: fix RemovePeer to disconnect the peer again

Also make RemovePeer synchronous and add a test.

* p2p: remove "Connection set up" log message

* p2p: clean up connection logging

We previously logged outgoing connection failures up to three times.

- in SetupConn() as "Setting up connection failed addr=..."
- in setupConn() with an error-specific message and "id=... addr=..."
- in dial() as "Dial error task=..."

This commit ensures a single log message is emitted per failure and adds
"id=... addr=... conn=..." everywhere (id= omitted when the ID isn't
known yet).

Also avoid printing a log message when a static dial fails but can't be
resolved because discv4 is disabled. The light client hit this case all
the time, increasing the message count to four lines per failed
connection.

* p2p: document that RemovePeer blocks

* les: add bootstrap nodes as initial discoveries (ethereum#20688)

* celo-blockchain addition: make static dials exempt from the maxDialPeers limit

Static peers are ones we explicitly want to connect to, and so should not be subject to this limit. This is especially important since connections between validators/proxies and other validators/proxies rely on this being so, and because that's how it was prior to the new dial scheduler from upstream.

* celo-blockchain addition: Add tests for two untested RemovePeer() scenarios

Co-authored-by: Felix Lange <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants