Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

p2p/discover: bootnodes not retried often enough when table becomes empty #20630

Open
tkporter opened this issue Feb 6, 2020 · 7 comments · Fixed by celo-org/celo-blockchain#1194
Assignees
Labels

Comments

@tkporter
Copy link

tkporter commented Feb 6, 2020

System information

Geth version: a fork of 1.9.8-stable
OS & Version: OSX
Commit hash : d62e9b285777c036c108b89fac0c78f7855ba314

Expected behaviour

A node should repeatedly try using a bootnode for discovery if the bootnode is not accessible at node startup

Actual behaviour

When the bootnode is unreachable at the start of the node, one discoverTask is created to the bootnode and never again. Looking at #20573, I'm not sure if this is already a known issue, but this is closely related.

Detailed breakdown:

When starting a discoveryTask, lookupRunning is set to true in this if statement:

if len(s.lookupBuf) < needDynDials && !s.lookupRunning {

So no other discoverTasks can be run at the same time. However, the discoverTask never finishes.

The task begins in Do():

t.results = enode.ReadNodes(srv.discmix, t.want)

which calls ReadNodes():

func ReadNodes(it Iterator, n int) []*Node {

ReadNodes uses the FairMix iterator, and calls Next():

func (m *FairMix) Next() bool {

Next will block for a value from source.next here:

case n, ok := <-source.next:

Which is given by a separate goroutine runSource:

func (m *FairMix) runSource(closed chan struct{}, s *mixSource) {

runSource uses another iterator, which is the lookupIterator in this case. I found that the lookupIterator Next() will result in an infinite loop that will case runSource to hang, and subsequently the channel source.next to never receive a value.
The infinite loop happens here

if it.lookup == nil {
as it switches between that if statement and the next one for each iteration.
This causes FairMix's Next() to time out, which is expected-- the only bad part is that the timeout calls m.nextFromAny() which reads from m.fromAny here:
n, ok := <-m.fromAny
, which will never be written to in runSource
case m.fromAny <- n:
because of that infinite loop in its iterator Next() call.

Steps to reproduce the behaviour

Start a node with --bootnode pointing to a bootnode that is inaccessible, then make the bootnode accessible & the node will not send any discoverTasks to the bootnode.

@ligi ligi assigned fjl Feb 6, 2020
@fjl
Copy link
Contributor

fjl commented Feb 6, 2020

This is definitely known, but it's also expected behavior. Discovery cannot work without bootstrapping and the lookup iterator will block until the table contains some nodes to do the lookup with.

We have a rewrite of the dialer coming up in #20592, but that PR mostly improves the latency of acting on discovery results. If there are no discovery results, no peer connections can be created.

@fjl
Copy link
Contributor

fjl commented Feb 6, 2020

Maybe I am misunderstanding your issue. AFAIK the lookupIterator will try to access the bootnodes repeatedly if they do not work. Please provide log output of geth --vmodule=p2p/*=5. The logs will show if that's the case.

@tkporter
Copy link
Author

tkporter commented Feb 6, 2020

Thanks for the response. I see dialTasks to the bootnode are attempted, but not discoverTasks apart from the initial one.

Here are the logs you asked for when running
geth --vmodule=p2p/*=5 --bootnodes=enode://f1acc5e79e47b26a3257697ae9640662fd8e01f3edcf590676132bc12a653d04916f269a075063be79ab0079271882688e478c51ded4e9a63aadc72f454b8c37@35.247.9.127:0?discport=30301 when the bootnode is unreachable. I started up the bootnode (to be clear, it's cmd/bootnode, not a full node) mid-run (I show where in the logs), and nothing is indicated on my node or bootnode that there have been discovery v4 messages. When I re-run geth once the bootnode is already up, everything works as expected.

p2plogs.txt

I think I see what you're saying about lookupIterator, do you mean that

it.lookup = it.nextLookup(it.ctx)
should result in a call to
func (t *UDPv4) newRandomLookup(ctx context.Context) *lookup {
?

I haven't taken time to look why that wasn't working, but it feels to me like it wasn't working as I expect it should because I saw an infinite loop between

if it.lookup == nil {
and
if !it.lookup.advance() {

I can try to look more into what the issue is there when I have some time

@fjl
Copy link
Contributor

fjl commented Feb 7, 2020

Thank you for the logs. It looks like the part that's broken is that newRandomLookup doesn't retry re-inserting the bootnodes when the discovery table is empty.

@fjl fjl changed the title discoverTask to an unreachable bootnode never finishes p2p/discover: bootnodes not retried often enough when table becomes empty Feb 7, 2020
@oneeman
Copy link
Contributor

oneeman commented Oct 22, 2020

@fjl I was looking at this and thought that it should have been fixed by #21396 (though it doesn't reinsert the bootnoodes), since that change means the bootnode won't be dropped there if it's the only node in the table (which is what happens in this scenario). But I found that it didn't fix the issue, and after looking into it I saw (from the logs saying Removed dead node) that the bootnode is being removed from the table as part of doRevalidate() (which runs every 10 seconds):

// No reply received, pick a replacement or delete the node if there aren't
// any replacements.
if r := tab.replace(b, last); r != nil {
tab.log.Debug("Replaced dead node", "b", bi, "id", last.ID(), "ip", last.IP(), "checks", last.livenessChecks, "r", r.ID(), "rip", r.IP())
} else {
tab.log.Debug("Removed dead node", "b", bi, "id", last.ID(), "ip", last.IP(), "checks", last.livenessChecks)
}

I figured it would make sense to change this, to go along with the change in #21396, and I could submit a PR for it, but I wasn't sure what the exact thing to do is, whether it's to avoid removing it (a) if it's the only one in the bucket and there are no replacements, (b) if the bucket is less than half-full as in #21396 and there are no replacements, or (c) something else. So I thought I'd ask to see what your thoughts on it were before proceeding. Thanks!

@wbt
Copy link
Contributor

wbt commented Jun 3, 2021

I have been noticing a related bug which might have similar root cause to this one but might not.
Setup (done on a Windows 10 development machine):

  1. Start a node to act as a bootnode on a nonstandard local chain and copy the enode address.
  2. With that first node running, in a separate terminal (PowerShell) start another with the --bootnodes parameter as the quoted enode address from the prior step.
  3. The second node will connect sync, etc. and act normally as expected. Let this run for a while.
  4. Hibernate the machine overnight.
  5. Bring the computer out of hibernation.
  6. The second node will no longer have any peers/connection to the original bootnode, and will not be able to discover it.
  7. This will continue indefinitely.
  8. Use CTRL+C to stop the second node, only.
  9. Restart the second node using the exact same command as before.
  10. The second node will again connect to the first and do catch-up resync, then stay connected until the next hibernate.

Expected behavior:
The enode address is retained. If the second node loses access to the bootnode and has no remaining peers, it'll try the bootnode again, just like it did at startup.

Is that the same issue as this, or should it be spun off as a separate one?

@wbt
Copy link
Contributor

wbt commented Feb 22, 2023

Any thoughts on pulling in oneeman's fix or response to the question above?

From my perspective (not a geth maintainer) option (b) seems to make the most sense - I'd want the bootnodes specified at startup to remain in the list and continue to be retried even if they are occasionally offline, but we don't want the table filled up completely with dead bootnodes while other live nodes are available on the network (unless --nodiscover is also used, in which case the potential harm from a table full of dead nodes seems like less of an issue).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants