p2p/discover: bootnodes not retried often enough when table becomes empty #20630

tkporter · 2020-02-06T02:49:37Z

System information

Geth version: a fork of 1.9.8-stable
OS & Version: OSX
Commit hash : d62e9b285777c036c108b89fac0c78f7855ba314

Expected behaviour

A node should repeatedly try using a bootnode for discovery if the bootnode is not accessible at node startup

Actual behaviour

When the bootnode is unreachable at the start of the node, one discoverTask is created to the bootnode and never again. Looking at #20573, I'm not sure if this is already a known issue, but this is closely related.

Detailed breakdown:

When starting a discoveryTask, lookupRunning is set to true in this if statement:

go-ethereum/p2p/dial.go

Line 184 in 976a0f5

if len(s.lookupBuf) < needDynDials && !s.lookupRunning {

So no other discoverTasks can be run at the same time. However, the discoverTask never finishes.

The task begins in Do():

go-ethereum/p2p/dial.go

Line 325 in 976a0f5

t.results = enode.ReadNodes(srv.discmix, t.want)

which calls ReadNodes():

go-ethereum/p2p/enode/iter.go

Line 36 in 976a0f5

func ReadNodes(it Iterator, n int) []*Node {

ReadNodes uses the FairMix iterator, and calls Next():

go-ethereum/p2p/enode/iter.go

Line 203 in 976a0f5

func (m *FairMix) Next() bool {

Next will block for a value from source.next here:

go-ethereum/p2p/enode/iter.go

Line 218 in 976a0f5

case n, ok := <-source.next:

Which is given by a separate goroutine runSource:

go-ethereum/p2p/enode/iter.go

Line 276 in 976a0f5

func (m *FairMix) runSource(closed chan struct{}, s *mixSource) {

runSource uses another iterator, which is the lookupIterator in this case. I found that the lookupIterator Next() will result in an infinite loop that will case runSource to hang, and subsequently the channel source.next to never receive a value.
The infinite loop happens here

go-ethereum/p2p/discover/lookup.go

Line 209 in 976a0f5

if it.lookup == nil {

as it switches between that if statement and the next one for each iteration.
This causes FairMix's Next() to time out, which is expected-- the only bad part is that the timeout calls m.nextFromAny() which reads from m.fromAny here:

go-ethereum/p2p/enode/iter.go

Line 241 in 976a0f5

n, ok := <-m.fromAny

, which will never be written to in runSource

go-ethereum/p2p/enode/iter.go

Line 283 in 976a0f5

case m.fromAny <- n:

because of that infinite loop in its iterator Next() call.

Steps to reproduce the behaviour

Start a node with --bootnode pointing to a bootnode that is inaccessible, then make the bootnode accessible & the node will not send any discoverTasks to the bootnode.

The text was updated successfully, but these errors were encountered:

fjl · 2020-02-06T10:35:18Z

This is definitely known, but it's also expected behavior. Discovery cannot work without bootstrapping and the lookup iterator will block until the table contains some nodes to do the lookup with.

We have a rewrite of the dialer coming up in #20592, but that PR mostly improves the latency of acting on discovery results. If there are no discovery results, no peer connections can be created.

fjl · 2020-02-06T10:38:17Z

Maybe I am misunderstanding your issue. AFAIK the lookupIterator will try to access the bootnodes repeatedly if they do not work. Please provide log output of geth --vmodule=p2p/*=5. The logs will show if that's the case.

tkporter · 2020-02-06T19:26:25Z

Thanks for the response. I see dialTasks to the bootnode are attempted, but not discoverTasks apart from the initial one.

Here are the logs you asked for when running
geth --vmodule=p2p/*=5 --bootnodes=enode://f1acc5e79e47b26a3257697ae9640662fd8e01f3edcf590676132bc12a653d04916f269a075063be79ab0079271882688e478c51ded4e9a63aadc72f454b8c37@35.247.9.127:0?discport=30301 when the bootnode is unreachable. I started up the bootnode (to be clear, it's cmd/bootnode, not a full node) mid-run (I show where in the logs), and nothing is indicated on my node or bootnode that there have been discovery v4 messages. When I re-run geth once the bootnode is already up, everything works as expected.

p2plogs.txt

I think I see what you're saying about lookupIterator, do you mean that

go-ethereum/p2p/discover/lookup.go

Line 194 in 2c37142

it.lookup = it.nextLookup(it.ctx)

should result in a call to

go-ethereum/p2p/discover/v4_udp.go

Line 418 in 2c37142

func (t *UDPv4) newRandomLookup(ctx context.Context) *lookup {

?

I haven't taken time to look why that wasn't working, but it feels to me like it wasn't working as I expect it should because I saw an infinite loop between

go-ethereum/p2p/discover/lookup.go

Line 193 in 2c37142

if it.lookup == nil {

and

go-ethereum/p2p/discover/lookup.go

Line 197 in 2c37142

if !it.lookup.advance() {

I can try to look more into what the issue is there when I have some time

fjl · 2020-02-07T09:41:08Z

Thank you for the logs. It looks like the part that's broken is that newRandomLookup doesn't retry re-inserting the bootnodes when the discovery table is empty.

oneeman · 2020-10-22T22:16:11Z

@fjl I was looking at this and thought that it should have been fixed by #21396 (though it doesn't reinsert the bootnoodes), since that change means the bootnode won't be dropped there if it's the only node in the table (which is what happens in this scenario). But I found that it didn't fix the issue, and after looking into it I saw (from the logs saying Removed dead node) that the bootnode is being removed from the table as part of doRevalidate() (which runs every 10 seconds):

go-ethereum/p2p/discover/table.go

Lines 348 to 354 in fb2c79d

 // No reply received, pick a replacement or delete the node if there aren't 

 // any replacements. 

 if r := tab.replace(b, last); r != nil { 

 tab.log.Debug("Replaced dead node", "b", bi, "id", last.ID(), "ip", last.IP(), "checks", last.livenessChecks, "r", r.ID(), "rip", r.IP()) 

 } else { 

 tab.log.Debug("Removed dead node", "b", bi, "id", last.ID(), "ip", last.IP(), "checks", last.livenessChecks) 

 }

I figured it would make sense to change this, to go along with the change in #21396, and I could submit a PR for it, but I wasn't sure what the exact thing to do is, whether it's to avoid removing it (a) if it's the only one in the bucket and there are no replacements, (b) if the bucket is less than half-full as in #21396 and there are no replacements, or (c) something else. So I thought I'd ask to see what your thoughts on it were before proceeding. Thanks!

wbt · 2021-06-03T19:22:02Z

I have been noticing a related bug which might have similar root cause to this one but might not.
Setup (done on a Windows 10 development machine):

Start a node to act as a bootnode on a nonstandard local chain and copy the enode address.
With that first node running, in a separate terminal (PowerShell) start another with the --bootnodes parameter as the quoted enode address from the prior step.
The second node will connect sync, etc. and act normally as expected. Let this run for a while.
Hibernate the machine overnight.
Bring the computer out of hibernation.
The second node will no longer have any peers/connection to the original bootnode, and will not be able to discover it.
This will continue indefinitely.
Use CTRL+C to stop the second node, only.
Restart the second node using the exact same command as before.
The second node will again connect to the first and do catch-up resync, then stay connected until the next hibernate.

Expected behavior:
The enode address is retained. If the second node loses access to the bootnode and has no remaining peers, it'll try the bootnode again, just like it did at startup.

Is that the same issue as this, or should it be spun off as a separate one?

wbt · 2023-02-22T15:33:54Z

Any thoughts on pulling in oneeman's fix or response to the question above?

From my perspective (not a geth maintainer) option (b) seems to make the most sense - I'd want the bootnodes specified at startup to remain in the list and continue to be retried even if they are occasionally offline, but we don't want the table filled up completely with dead bootnodes while other live nodes are available on the network (unless --nodiscover is also used, in which case the potential harm from a table full of dead nodes seems like less of an issue).

ligi assigned fjl Feb 6, 2020

fjl added area:p2p need:more-information labels Feb 6, 2020

no-response bot removed the need:more-information label Feb 6, 2020

fjl changed the title ~~discoverTask to an unreachable bootnode never finishes~~ p2p/discover: bootnodes not retried often enough when table becomes empty Feb 7, 2020

oneeman mentioned this issue Oct 23, 2020

Oneeman/cherry pick discovery bootnode fix celo-org/celo-blockchain#1194

Merged

wbt mentioned this issue Jun 29, 2022

Reconnect to bootnodes if peer connection is lost #25203

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

p2p/discover: bootnodes not retried often enough when table becomes empty #20630

p2p/discover: bootnodes not retried often enough when table becomes empty #20630

tkporter commented Feb 6, 2020

fjl commented Feb 6, 2020

fjl commented Feb 6, 2020

tkporter commented Feb 6, 2020 •

edited

Loading

fjl commented Feb 7, 2020

oneeman commented Oct 22, 2020 •

edited

Loading

wbt commented Jun 3, 2021

wbt commented Feb 22, 2023

p2p/discover: bootnodes not retried often enough when table becomes empty #20630

p2p/discover: bootnodes not retried often enough when table becomes empty #20630

Comments

tkporter commented Feb 6, 2020

System information

Expected behaviour

Actual behaviour

Detailed breakdown:

Steps to reproduce the behaviour

fjl commented Feb 6, 2020

fjl commented Feb 6, 2020

tkporter commented Feb 6, 2020 • edited Loading

fjl commented Feb 7, 2020

oneeman commented Oct 22, 2020 • edited Loading

wbt commented Jun 3, 2021

wbt commented Feb 22, 2023

tkporter commented Feb 6, 2020 •

edited

Loading

oneeman commented Oct 22, 2020 •

edited

Loading