swarm: potential issue with dial limiter #1553

whyrusleeping · 2018-06-07T10:23:00Z

So this isnt necessarily an implementation bug, but rather a design bug. While investigating connectivity issues i put together a log scraper coalescer thingy that gathers information about each overall dial attempt and each of its individual dials. The full log is here, and should be pretty interesting for anyone who cares about dials.

I'm making this issue because I see a lot of dials like this one:

{
  "dials": [
    {
      "targetAddr": "/ip4/127.0.0.1/tcp/4001",
      "result": "",
      "error": "dial tcp4 127.0.0.1:4001: connect: connection refused",
      "duration": "196.56µs",
      "start": "2018-06-07T02:56:33.091027367-07:00"
    },
    {
      "targetAddr": "/ip6/::1/tcp/4001",
      "result": "",
      "error": "dial tcp6 [::1]:4001: connect: connection refused",
      "duration": "453.875µs",
      "start": "2018-06-07T02:56:33.090807478-07:00"
    },
    {
      "targetAddr": "/ip4/94.24.55.20/tcp/4001",
      "result": "",
      "duration": "3.447658049s",
      "start": "2018-06-07T02:56:33.090940097-07:00"
    }
  ],
  "success": true,
  "duration": "29.289350417s",
  "target": "QmSz8K4YJYwQYvsDjgsYQTHhyf3rg3BUNkpvGqmxq8sDHb",
  "err": "",
  "start": "2018-06-07T02:56:07.249681968-07:00"
}

Where the overall dial takes about 30 seconds, but the sum of the dials involved take only ~3.5s.

Then theres also this one:

{
  "dials": [
    {
      "targetAddr": "/ip6/::1/tcp/4001",
      "result": "",
      "error": "dial tcp6 [::1]:4001: connect: connection refused",
      "duration": "577.624µs",
      "start": "2018-06-07T02:55:36.981039644-07:00"
    },
    {
      "targetAddr": "/ip4/127.0.0.1/tcp/4001",
      "result": "",
      "error": "dial tcp4 127.0.0.1:4001: connect: connection refused",
      "duration": "401.141µs",
      "start": "2018-06-07T02:55:36.981775784-07:00"
    }
  ],
  "success": false,
  "duration": "1m0.001597724s",
  "target": "Qmdx7w4sYJirLNCXpKtyfMsE3ZHzNVSnH94Ej4QFphDHxC",
  "err": "<peer.ID Qb3Xmt> --> <peer.ID dx7w4s> dial attempt failed: dial tcp4 127.0.0.1:4001: connect: connection refused",
  "start": "2018-06-07T02:55:36.074255709-07:00"
}

Where the overall dial operation takes a full minute, but only contains two dials, each of which took ~0 time.

whyrusleeping · 2018-06-07T10:23:09Z

cc @magik6k @Stebalien

magik6k · 2018-06-07T12:27:13Z

Digging through the data there a few things I see:

We should give LAN address dials lower timeout
- Most dials to these addresses just sit around taking dial limiter space for 60s
Try to sort addresses based on how likely we are to reach them
- internet [ipv6, ipv4] > lan [ipv4, ipv6] > relay?
  - We can probably use [RFC] Delay dials to relay addresses go-libp2p-swarm#57 to do that
We should lower dialLimiter.perPeerLimit when there are dials to many unique peers
- What is happening currently is that most FDs are taken by few peers (So instead of trying to dial to 160 peers one address at a time, we dial to 20 with 8 concurrent dials)

whyrusleeping · 2018-06-07T13:56:40Z

We should give LAN address dials lower timeout

Yeah, big +1 here. I honestly think 5s should be plenty here.
What we really need is more granular timeouts. We need one timeout for the raw underlying 'connect' call, and then a slightly larger timeout for the crypto handshake part.

We should lower dialLimiter.perPeerLimit when there are dials to many unique peers

Yeah, this is a good idea. We will have to come up with a good heuristic here, but it should be pretty simple to do.

magik6k · 2018-06-07T17:27:13Z

So I just realized that there may be another problem:

Example:

Start mass-dialing peers (basically any dht action)
hit a set of peers which will 'clog up' the dial limiter.
Hit a timeout on all waiting connections, even though they could've been easily reachable
- That happens because we have a single dial timeout applied to all connections before the limiter code https:/libp2p/go-libp2p-swarm/blob/master/swarm.go#L27

This is another thing fixable by making timeouts more granular / smarter

Stebalien · 2018-06-07T17:44:42Z

Damn. Yeah, that dial timeout is in the wrong place. It should be per connection.

However, moving the dial timeout may make things worse: a single peer with a ton of addresses could clog the dialer forever.

So, we may need two (max time dialing a single peer, max time per connection).

magik6k · 2018-06-07T18:44:45Z

However, moving the dial timeout may make things worse: a single peer with a ton of addresses could clog the dialer forever.

Not really as we allow only 8 concurrent dial per peer at a time, but yeah, it's still an issue

Stebalien · 2019-01-25T00:42:56Z

So, we may need two (max time dialing a single peer, max time per connection).

Status: We now have this.

raulk · 2019-01-31T19:46:01Z

@whyrusleeping It's possible that in your second example there was 1 or more dials parked by the limiter due to FD limits being exceeded. Not sure where the tracing kicks in in your logging gadget, but if it's when the dial is actually inflight, that would explain why the trace was 1min long but it only contained two adjacent ~0.5ms spans.

magik6k mentioned this issue Jun 7, 2018

feat: lower timeout for local address dials libp2p/go-libp2p-swarm#70

Merged

magik6k mentioned this issue Jun 11, 2018

make the dial timeout apply per dial attempt libp2p/go-libp2p-swarm#71

Closed

marten-seemann changed the title ~~potential issue with dial limiter~~ swarm: potential issue with dial limiter May 25, 2022

marten-seemann transferred this issue from libp2p/go-libp2p-swarm May 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

swarm: potential issue with dial limiter #1553

swarm: potential issue with dial limiter #1553

whyrusleeping commented Jun 7, 2018

whyrusleeping commented Jun 7, 2018

magik6k commented Jun 7, 2018

whyrusleeping commented Jun 7, 2018

magik6k commented Jun 7, 2018

Stebalien commented Jun 7, 2018

magik6k commented Jun 7, 2018

Stebalien commented Jan 25, 2019

raulk commented Jan 31, 2019 •

edited

Loading

swarm: potential issue with dial limiter #1553

swarm: potential issue with dial limiter #1553

Comments

whyrusleeping commented Jun 7, 2018

whyrusleeping commented Jun 7, 2018

magik6k commented Jun 7, 2018

whyrusleeping commented Jun 7, 2018

magik6k commented Jun 7, 2018

Stebalien commented Jun 7, 2018

magik6k commented Jun 7, 2018

Stebalien commented Jan 25, 2019

raulk commented Jan 31, 2019 • edited Loading

raulk commented Jan 31, 2019 •

edited

Loading