Skip to content

Commit

Permalink
connectd: don't start connecting in parallel in peer_conn_closed.
Browse files Browse the repository at this point in the history
The crash below from @zerofeerouting left me confused.  The invalid
value in fmt_wireaddr_internal is a telltale sign of use-after-free.

This backtrace shows us destroying the conn *twice*: what's happening?

Well, tal carefully protects against destroying twice: it's not that
unusual to free something in a destructor which has already been freed.
So this indicates that there are *two* io_conn hanging off one
struct connecting, which isn't supposed to happen!  We deliberately
call try_connect_one_addr() initially, then inside the io_conn destructor.

But due to races in connectd vs lightningd connection state, we added
a fix which allows a connect command to sit around while the peer is
cleaning up (6cc9f37) and get fired
off when it's done.

But what if, in the chaos, we are already connecting again?  Now we'll
end up with *two* connections.

Fortunately, we have a `conn` pointer inside struct connecting, which
(with a bit of additional care) we can ensure is only non-NULL while
we're actually trying to connect.  This lets us check that before
firing off a new connection attempt in peer_conn_closed.

```
lightning_connectd: FATAL SIGNAL 6 (version v0.11.2rc2-2-g8f7e939)
0x5614a4915ae8 send_backtrace
	common/daemon.c:33
0x5614a4915b72 crashdump
	common/daemon.c:46
0x7ffa14fcd72f ???
	???:0
0x7ffa14dc87bb ???
	???:0
0x7ffa14db3534 ???
	???:0
0x5614a491fc71 fmt_wireaddr_internal
	common/wireaddr.c:255
0x5614a491fc7a fmt_wireaddr_internal_
	common/wireaddr.c:257
0x5614a491ea6b type_to_string_
	common/type_to_string.c:32
0x5614a490beaa destroy_io_conn
	connectd/connectd.c:754
0x5614a494a2f1 destroy_conn
	ccan/ccan/io/poll.c:246
0x5614a494a313 destroy_conn_close_fd
	ccan/ccan/io/poll.c:252
0x5614a4953804 notify
	ccan/ccan/tal/tal.c:240
0x5614a49538d6 del_tree
	ccan/ccan/tal/tal.c:402
0x5614a4953928 del_tree
	ccan/ccan/tal/tal.c:412
0x5614a4953e07 tal_free
	ccan/ccan/tal/tal.c:486
0x5614a4908b7a try_connect_one_addr
	connectd/connectd.c:870
0x5614a490bef1 destroy_io_conn
	connectd/connectd.c:759
0x5614a494a2f1 destroy_conn
	ccan/ccan/io/poll.c:246
0x5614a494a313 destroy_conn_close_fd
	ccan/ccan/io/poll.c:252
0x5614a4953804 notify
	ccan/ccan/tal/tal.c:240
0x5614a49538d6 del_tree
	ccan/ccan/tal/tal.c:402
0x5614a4953e07 tal_free
	ccan/ccan/tal/tal.c:486
0x5614a4948f08 io_close
	ccan/ccan/io/io.c:450
0x5614a4948f59 do_plan
	ccan/ccan/io/io.c:401
0x5614a4948fe1 io_ready
	ccan/ccan/io/io.c:417
0x5614a494a8e6 io_loop
	ccan/ccan/io/poll.c:453
0x5614a490c12f main
	connectd/connectd.c:2164
0x7ffa14db509a ???
	???:0
0x5614a4904e99 ???
	???:0
0xffffffffffffffff ???
	???:0
```

Signed-off-by: Rusty Russell <[email protected]>
  • Loading branch information
rustyrussell committed Jun 24, 2022
1 parent 8f7e939 commit c9587ab
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 4 deletions.
5 changes: 4 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.11.2] - 2022-06-20: Simon's Carefully Chosen Release Name III
## [0.11.2] - 2022-06-24: Simon's Carefully Chosen Release Name III

Regressions since 0.10.2 which could not wait for the 0.12 release,
which especially hurt larger nodes.
Expand All @@ -13,6 +13,7 @@ which especially hurt larger nodes.

- Protocol: treat LND "internal error" as warnings, not force close events (like v0.10) ([#5326])
- connectd: no longer occasional crashes when peers reconnect. ([#5300])
- connectd: another crash fix on trying to reconnect to disconnecting peer. ([#5340])
- topology: Under some circumstances we were considering the limits on the wrong direction for a channel ([#5286])
- routing: Fixed an issue where we would exclude the entire channel if either direction was disabled, or we hadn't seen an update yet. ([#5286])
- connectd: large memory usage with many peers fixed. ([#5312])
Expand All @@ -25,6 +26,8 @@ which especially hurt larger nodes.
[#5326]: https:/ElementsProject/lightning/pull/5326
[#5328]: https:/ElementsProject/lightning/pull/5328
[#5331]: https:/ElementsProject/lightning/pull/5331
[#5340]: https:/ElementsProject/lightning/pull/5340

[0.11.2]: https:/ElementsProject/lightning/releases/tag/v0.11.2

## [0.11.1] - 2022-05-13: Simon's Carefully Chosen Release Name II
Expand Down
7 changes: 4 additions & 3 deletions connectd/connectd.c
Original file line number Diff line number Diff line change
Expand Up @@ -756,6 +756,7 @@ static void destroy_io_conn(struct io_conn *conn, struct connecting *connect)
&connect->addrs[connect->addrnum]),
connect->connstate, errstr));
connect->addrnum++;
connect->conn = NULL;
try_connect_one_addr(connect);
}

Expand Down Expand Up @@ -857,8 +858,7 @@ static void try_connect_one_addr(struct connecting *connect)
struct sockaddr_in6 *sa6;
#endif

/* In case we fail without a connection, make destroy_io_conn happy */
connect->conn = NULL;
assert(!connect->conn);

/* Out of addresses? */
if (connect->addrnum == tal_count(connect->addrs)) {
Expand Down Expand Up @@ -1879,6 +1879,7 @@ static void try_connect_peer(struct daemon *daemon,
connect->seconds_waited = seconds_waited;
connect->addrhint = tal_steal(connect, addrhint);
connect->errors = tal_strdup(connect, "");
connect->conn = NULL;
list_add_tail(&daemon->connecting, &connect->list);
tal_add_destructor(connect, destroy_connecting);

Expand Down Expand Up @@ -1931,7 +1932,7 @@ void peer_conn_closed(struct peer *peer)
tal_free(peer);

/* If we wanted to connect to it, but found it was exiting, try again */
if (connect)
if (connect && !connect->conn)
try_connect_one_addr(connect);
}

Expand Down

0 comments on commit c9587ab

Please sign in to comment.