Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bluetooth: fix HCI command / data deadlock #52536

Merged
merged 4 commits into from
Dec 15, 2022

Conversation

jori-nordic
Copy link
Collaborator

See commits for more info.
TLDR: makes the ACL fragmentation asynchronous, freeing the TX thread for sending commands when the ACL buffers are full.

Fixes #25917

@jori-nordic jori-nordic added area: Bluetooth Host bug The issue is a bug, or the PR is fixing a bug labels Nov 24, 2022
/* Get next ACL packet for connection. The buffer will only get dequeued
* if there is a free controller buffer to put it in.
*/
buf = k_fifo_peek_head(&conn->tx_queue);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with this is that you're skipping the fragment handling that net_buf_get() takes care of. We don't (AFAIK) currently take advantage of the net_buf fragment feature in the Bluetooth subsystem, but at least we've used the proper net_buf APIs until now that should keep the code working if we did use fragments.

At the very least, in addition to asserting that buf != NULL, you should also BT_ASSERT((buf->flags & NET_BUF_FRAGS) == 0);. Though even that's pretty unclean, since these are really net_buf-internal features.

A cleaner way would be to introduce a new net_buf_peek() API that does the same as net_buf_get() but also fetches any pending fragments.

Yet another, perhaps more long term solution, would be to finally move the net_buf fragments list into it's own independent list that's not mangled together with the linked list pointer as part of the same union (which is the reason for the NET_BUF_FRAGS flag and why fragments are placed as independent objects into the FIFO). Currently the net_buf definition starts off like this:

struct net_buf {
	union {
		/** Allow placing the buffer into sys_slist_t */
		sys_snode_t node;

		/** Fragments associated with this buffer. */
		struct net_buf *frags;
	};

The penalty of moving frags out from the union would be that the struct size increases by the size of a pointer, but the benefit would be that we don't have this hassle of serializing & deserializing fragments whenever buffers pass through LIFO or FIFO objects.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very least, in addition to asserting that buf != NULL, you should also BT_ASSERT((buf->flags & NET_BUF_FRAGS) == 0);. Though even that's pretty unclean, since these are really net_buf-internal features.

Oh, and in addition to the assert, you need to explicitly set buf->frags = NULL since the k_fifo implementation leaves the linked list pointer untouched when removing from the list.

Also see #38830 (review) and related discussion in the PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that really a problem since I'm using the _get before passing it to bt_send? That takes care of the fragments, and the peek is a non-destructive operation afaik.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very least you need to clear the buf->frags pointer since k_fifo_peek_head() doesn't do that for you. You have the EIO branch below where e.g. net_buf_unref() is called, and net_buf_unref() will also unref the frags pointer if it is non-NULL. Same goes for any other net_buf API calls you might do on the buffer before calling net_buf_get()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, for correctness, it would be better to get a proper reference to the buffer when peeking, i.e. call net_buf_ref() on the pointer. Then, when you finally call net_buf_ger() the code would look something like the following (i.e.no need for the dequeued variable):

net_buf_unref(buf);
buf = net_buf_get(...);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and in addition to the assert, you need to explicitly set buf->frags = NULL since the k_fifo implementation leaves the linked list pointer untouched when removing from the list.

I just tried and it didn't work. I think it's because buf->frags is in a union with the node ptr.
So if I overwrite it with NULL while it is still on the queue, and it ends up that I don't pull it from the fifo (e.g. the sem_take fails), the fifo list is effectively broken.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, good point. And this is another reason why using peak() together with the union is bad - you end up with the buffer in an inconsistent state until you actually remove it from from the FIFO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The short term solution would then probably be to document this clearly and try to make the code flow path from peek() to get() as short and obvious as possible.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep agreed. I will place a comment above the peek saying this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I reverted the assert since #52760 got merged.

Comment on lines 625 to 629
/* De-queue the buffer now that we know we can send it.
* Only applies if the buffer to be sent is the original buffer,
* and not one of its fragments.
*/
struct net_buf *dequeued = net_buf_get(&conn->tx_queue, K_NO_WAIT);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code comment should explicitly mention that the buffer was fetched from the FIFO using a "peek" operation.

#endif /* CONFIG_BT_CONN */
}

static int _send_frag(struct bt_conn *conn, struct net_buf *buf, uint8_t flags)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be static int do_send_frag()

@jori-nordic
Copy link
Collaborator Author

@jhedberg could you review again?

BT_ASSERT(buf);
if (!send_buf(conn, buf)) {

buf = net_buf_ref(buf);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a code comment above this, something along the lines of "Since we used peek(), the queue still owns the reference to the buffer, so we need to take an explicit additional reference here.

@carlescufi
Copy link
Member

@runegrunnet could you please take a look?

jhedberg
jhedberg previously approved these changes Dec 14, 2022
subsys/bluetooth/host/conn.c Outdated Show resolved Hide resolved
subsys/bluetooth/host/conn.c Outdated Show resolved Hide resolved
subsys/bluetooth/host/conn.c Show resolved Hide resolved
subsys/bluetooth/host/conn.c Outdated Show resolved Hide resolved
subsys/bluetooth/host/conn.c Outdated Show resolved Hide resolved
subsys/bluetooth/host/conn.c Outdated Show resolved Hide resolved
tests/bluetooth/bsim_bt/bsim_test_l2cap_stress/src/main.c Outdated Show resolved Hide resolved
tests/bluetooth/bsim_bt/bsim_test_l2cap_stress/src/main.c Outdated Show resolved Hide resolved
tests/bluetooth/bsim_bt/bsim_test_l2cap_stress/src/main.c Outdated Show resolved Hide resolved
@jori-nordic
Copy link
Collaborator Author

#37577 was very useful for reproducing the deadlock in simulation, so thanks a lot @prathje ! You might want to review this as well, or try it on hardware.

Make the ACL fragmentation asynchronous, freeing the TX thread for sending
commands when the ACL buffers are full.

Signed-off-by: Jonathan Rico <[email protected]>
When there are no buffers, it doesn't make sense to repeatedly try to send
the host TX queue.

Signed-off-by: Jonathan Rico <[email protected]>
@prathje
Copy link
Contributor

prathje commented Dec 14, 2022

#37577 was very useful for reproducing the deadlock in simulation, so thanks a lot @prathje ! You might want to review this as well, or try it on hardware.

Nice, glad to hear! Sadly, I am currently not working with BLE and unable to review this in detail. Thanks for mentioning me though, I am pleased to see the progress :)

Only copy the data from the 'parent' buf into the segment buf if we
successfully acquired the HCI semaphore (ie there are available controller
buffers).

This way we avoid pulling and pushing back the data in the case where there
aren't any controller buffers available.

Signed-off-by: Jonathan Rico <[email protected]>
Do these things:
- use the proper macros for reserving the SDU header
- make every L2CAP PDU fragment into 3 ACL packets
- set tx data and verify rx matches the pattern
- measure segment pool usage

Signed-off-by: Jonathan Rico <[email protected]>
@@ -372,6 +386,11 @@ static void test_central_main(void)
LOG_DBG("*L2CAP STRESS Central started*");
int err;

/* Prepare tx_data */
for (size_t i = 0; i < sizeof(tx_data); i++) {
tx_data[i] = (uint8_t)i;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type of the variable being assigned to (tx_data[i]) is already uint8_t, so truncation happens either way, ie the explicit typecast is redundant.

Copy link
Collaborator Author

@jori-nordic jori-nordic Dec 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least with the cast it's visible that it's on purpose and not a mistake

@fabiobaltieri fabiobaltieri merged commit 8bc0946 into zephyrproject-rtos:main Dec 15, 2022
@alwa-nordic alwa-nordic added the backport v2.7-branch Request backport to the v2.7-branch label May 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Bluetooth Host area: Bluetooth backport v2.7-branch Request backport to the v2.7-branch bug The issue is a bug, or the PR is fixing a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bluetooth: Deadlock with TX of ACL data and HCI commands (command blocked by data)
8 participants