Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bluetooth: mesh: update heartbeat use of k_work API #33148

Merged
merged 2 commits into from
Mar 10, 2021

Conversation

pabigot
Copy link
Collaborator

@pabigot pabigot commented Mar 8, 2021

Replace the legacy delayed work API with the new delayable work API. Avoid race conditions in canceling work items by using a zero period as a flag value to ensure that the work publish handler is a no-op of the publish operation is disabled, and allow subscription expiration to be done by the handler.

Copy link
Collaborator Author

@pabigot pabigot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: in the absence of a meaningful sample I can't test this; please verify that it works.

@@ -149,17 +152,18 @@ static void hb_publish(struct k_work *work)

BT_DBG("hb_pub.count: %u", pub.count);

/* Fast exit if disabled or expired */
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a subtle behavior change here: for publish where the count had expired publish could detect a lost subnet and change state. Assuming that is not required behavior I've moved the fast-exit for expired/disabled before the subnet check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works. Looks like the publish disabled decision point isn't handled consistently, but I'll address this in a separate PR after this goes in.

@@ -149,17 +152,18 @@ static void hb_publish(struct k_work *work)

BT_DBG("hb_pub.count: %u", pub.count);

/* Fast exit if disabled or expired */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works. Looks like the publish disabled decision point isn't handled consistently, but I'll address this in a separate PR after this goes in.

@@ -216,7 +220,7 @@ static void pub_disable(void)
pub.ttl = 0U;
pub.period = 0U;

k_delayed_work_cancel(&pub_timer);
k_work_reschedule(&pub_timer, K_NO_WAIT);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you refrained from using k_work_cancel_delayable here just to stay consistent with the change on line 259, or is there some reason why cancel shouldn't be used?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found that most uses of k_delayed_work_cancel() are wrong because they don't check for failure, and if the work item is still running they aren't prepared for what might happen. This is largely due to incomplete handling of thread safety, as may be the case in this implementation (different threads can be accessing the pub fields simultaneously). See #33104, which is a work in progress and doesn't adequately address the complexities of this yet.

So where "cancel" can be replaced with "set a flag and allow the work item to do its job and then stop" I prefer that.

In this case it probably doesn't add any real value to try to avoid the cancel. I've reverted this for now, so it's more clear that there's something not quite right here that should be revisited when we have a better understanding of the nuances.

if (!k_delayed_work_cancel(&sub_timer)) {
notify_sub_end();
}
sub.period = 0U;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to follow this line up with a PTS run, I think. I couldn't find any places in the test spec where the old behavior was required, but it could be a quirk in the PTS implementation.

Won't let this block this PR though.
cc @michal-narajowski

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran HBP and HBS tests with this PR and they are passing so this looks good to me.

Replace the legacy delayed work API with the new delayable work API.
Use a zero period as a flag value to ensure that the work handler is a
no-op of the publish operation is disabled.

Signed-off-by: Peter Bigot <[email protected]>
Replace the legacy delayed work API with the new delayable work API.
Avoid cancelling work and manually notifying when the subscription is
disabled; instead allow the work item to do this.

Signed-off-by: Peter Bigot <[email protected]>
@jhedberg jhedberg merged commit 5506727 into zephyrproject-rtos:master Mar 10, 2021
@pabigot pabigot deleted the nordic/20210308c branch March 15, 2021 13:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants