Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential deadlock between RequestRoute and LightningChannel commitment state machine #9133

Open
bjohnson5 opened this issue Sep 23, 2024 Discussed in #9060 · 1 comment
Assignees
Labels
bug Unintended code behaviour kvdb
Milestone

Comments

@bjohnson5
Copy link

Discussed in #9060

Originally posted by bjohnson5 September 3, 2024
I believe I have discovered a potential deadlock situation, but I am relatively new to LND and wanted to discuss it before opening an issue, to be sure that I am not missing something. This is when using the bbolt backend database.

In lnwallet/channel.go the LightningChannel struct defines several methods that the comments explain as the "state machine which corresponds to the current commitment protocol wire spec". These methods are: SignNextCommitment, ReceiveNewCommitment, RevokeCurrentCommitment, and ReceiveRevocation. Each of these will first lock the LightningChannel: lc.lock() and then they will typically attempt to update the channel db.

When updating the channel db, sometimes the database must be re-sized and re-mapped to memory using the mmap function in bbolt's db.go file. This function first attempts to lock the mmaplock mutex.

This is all fine except that if one of the state machine functions is called while the node is trying to find a route a deadlock could occur. The RequestRoute function in payment_session.go will get a routing graph from the db and this will acquire the mmaplock on the db (for good reason, it needs to be sure the db is not re-mapped while it is using it to find a route). It will eventually call functions of the LightningChannel struct in order to find bandwidth, balances, etc... It is possible that these functions are locked by one of the state machine methods and that state machine method could be stuck waiting on the mmaplock.

For example:
Thread1: RequestRoute -> NewGraphSession() -> mmaplock.lock()
----------------------------------> p.pathFinder -> availableChanBandwidth -> attempts to call LC functions, blocks waiting on lc.lock()

Thread2: ReceiveRevocation -> lc.lock()
-----------------------------------------> AdvanceCommitChainTail -> attempts to update db, blocks waiting on mmaplock.lock()

If anyone has experience in this area, please let me know if this is all correct and if I should open an issue. Thanks.

@ziggie1984
Copy link
Collaborator

So analysed this issue, and I think there is absolutely no reason we keep an open transaction as part of the session implementation. Its only used there:

func (g *session) ForEachNodeChannel(nodePub route.Vertex,
cb func(channel *channeldb.DirectedChannel) error) error {
return g.graph.ForEachNodeDirectedChannel(g.tx, nodePub, cb)
}

and I think there is no problem in reading the db once at this point fetching all the channels for the local node and closing the read transaction.

Should be fairly easy to fix.

@ziggie1984 ziggie1984 self-assigned this Sep 23, 2024
@ziggie1984 ziggie1984 added this to the v0.19.0 milestone Sep 23, 2024
@ziggie1984 ziggie1984 added bug Unintended code behaviour kvdb labels Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Unintended code behaviour kvdb
Projects
Status: Todo
Development

No branches or pull requests

2 participants