Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating Zephyr to fix Work Queue Problems #33104

Closed
18 tasks done
pabigot opened this issue Mar 7, 2021 · 10 comments
Closed
18 tasks done

Updating Zephyr to fix Work Queue Problems #33104

pabigot opened this issue Mar 7, 2021 · 10 comments
Labels
area: Kernel Meta A collection of features, enhancements or bugs

Comments

@pabigot
Copy link
Collaborator

pabigot commented Mar 7, 2021

Issue #27356 identified a host of race conditions in the Zephyr work queue implementation through release 2.5, and in the way that the API has been used. #28579 (comment) provides a summary of the problems in the original implementation.

The replacement implementation in #29618 that will be in 2.6 eliminated the gaps in being able to precisely determine the state of a work item at the point an operation on it is performed. A summary of work item states is presented in the documentation.

However: fixing the implementation does not eliminate the problems that result from the ways the API is being used.

With the new implementation it is still necessary to take into account:

  • TOCTOU races
  • The fact that operations like cancel can fail to complete immediately

This issue serves as a short description of the remaining problems and how to address them, and as a place to hold pointers to PRs where those problems are addressed:

Conversion Status

The following areas seem to have API that needs to be converted; a rote conversion can be found for each in a commit in #33924, which also contains a coccinelle script that was used to do the conversion. Areas must be assigned to maintainers, who should apply the changes from the corresponding commit in #33924, review and adapt it, and submit it as a standalone PR to help verify that there are no dependencies between changes that would break bisectability. Maintainers should also review the best practices that follow, as well as comments and commit messages from the manual conversion, to mitigate any inherent bugs in the existing use of work items.

Anti-Patterns

Checking whether work is pending before deciding whether to submit/schedule

Instantaneous check of work item state is truly instantaneous. Application and subsystem logic that interprets and acts on that state can be incorrect if the animating thread is preempted, or an interrupt occurs, or another core is simulatenously operating on work items.

In the original implementation k_work_pending could return false even though the work item was still in use. Some in-tree code uses the false return as an indication the work item is available, potentially resetting the structure content while it's being used by another thread (see 4817a41 from #32890).

While this specific failure mode is no longer possible ("not pending" does mean the work item is, at that instant, not in use), the other direction remains: that a work item is still in use does not mean that it's unnecessary to submit it again so that newly available work will be processed. The work item may have all but finished, and without a second submit there will be a potentially unbounded delay in processing new work.

Presence of any of these in a conditional operation before invoking a work queue operation should be reviewed:

  • Any use of k_work_pending(), k_work_is_pending(), k_delayable_work_pending() except:
    • in diagnostic messages; or
    • when used to confirm that the work item is not busy (so its state is available for reconfiguration);
  • The comparison of k_delayable_work_remaining() to zero;
  • Interpretation of specific states as indicated by k_{,delayable_}work_busy_get()

Please note that merely changing the legacy API in these uses to the new API does not fix the problem. As an example, although 7b8bce8 was merged in #32890, the openthread code may still misbehave, especially in SMP systems or when preemption is involved. See the description of that commit for details.

Unchecked Return Values

The vast majority of uses of k_delayed_work_cancel() in Zephyr currently assume that this operation will never fail, and so that after requesting a cancel the work item could be re-scheduled or reconfigured, or that there was a guarantee that the work item would not be invoked.

This was never true, and it is even more true as a result of adding support to synchronize with work items, because a work item will still be pending for a small window after its handler function has returned.

Best Practices

In most cases avoiding the race conditions requires code and data changes. Below are some practices to consider when refactoring.

Don't over optimize

The k_work_*() API is designed to be safe when invoked from multiple threads and interrupts. Attempts to externally inspect a work item's state and make decisions based on the result are likely to create new problems.

So when new work comes in, just submit it. Don't attempt to "optimize" by checking whether the work item is already submitted by inspecting snapshot state with k_work_is_pending() or k_work_busy_get(), or for a non-zero delay from k_work_delayable_remaining_get(). Those checks are fragile: a "busy" indication can be obsolete by the time the test is returned, and a "not-busy" indication can also be wrong if work is submitted from multiple contexts.

Using the work handler as the standard cleanup path also makes the code more clear: rather than having to deal with cancellation and cleanup at points where items are submitted, you may be able to have everything done in one place: the work handler itself.

A rare case where you could safely use k_work_is_pending() is as a check to avoid invoking k_work_flush() or k_work_cancel_sync(), if you are certain that nothing else might submit the work while you're checking. In some cases it can be more robust to just issue the command and have the state checked in the handler, especially if you need locked access to state anyway.

Document Uses of Racy Idioms

There may be cases where temporary or even permanent work still invokes operations like k_cancel_delayable() without being able to check or respond to indications that the commit fails. It is good practice to document at the point of call what happens in that case, e.g.:

/* If this fails, the timer will check pub->addr and
* exit without transmitting.
*/
(void)k_work_cancel_delayable(&model->pub->timer);

Protecting Associated Data

Sometimes the data a work item must process is naturally thread-safe, for example when it's put into a k_queue by some thread and processed in the work thread. Where it is not, external synchronization is required to ensure neither the work thread nor other threads are inspecting/manipulating state simultaneously with another thread or interrupt.

Spin locks (k_spinlock_t) or thread-aware locks (k_sem, k_mutex, ...) may be used to ensure data races don't occur. If work is added in an ISR a spin-lock may be required.

If the selected lock mechanism sleeps then allowing the work thread to sleep will starve other work queue threads, which may need to make progress in order to get the lock released. Work handlers should try to take the lock with its no-wait path. For example:

if (k_mutex_lock(&parent->lock, K_NO_WAIT) != 0) {
    /* NB: Submit will fail if the work item is being cancelled. */
    k_work_submit(work, K_NO_WAIT);
    return;
}

If you need to use a lock this way be sure that your code is robust when resubmission fails because the work item had been cancelled. Usually to make this robust some other thread must be waiting for the work item to complete (e.g. via k_work_cancel_sync()) or must periodically inspect the state to confirm that it's no longer pending (as check_query_active does in https:/zephyrproject-rtos/zephyr/blob/master/subsys/net/lib/dns/resolve.c).

Note that work items (in isolation) are self-locking, so you don't need to hold an external lock just to re-submit or (re)schedule them. Even if you use external state protected by such a lock to prevent further resubmission, it's safe to do the resubmit as long as you're sure that eventually the item will take its lock and check that state to determine it should do nothing.

Use the Right Scheduling Operation

For delayed work there are two common use cases, depending on whether a deadline should be extended if a new event occurs. An example is collecting data that comes in asynchronously, e.g. characters from a UART associated with a keyboard. The desired behavior might be:

  1. When data arrives keep collecting until a specified delay since the first unprocessed data was received. For this, use k_work_schedule().
  2. When data arrives keep collecting until a specified interval has elapsed since the last time unprocessed data was received. For this, use k_work_reschedule().

In neither of these cases should the code be looking at k_work_delayable_remaining_get() to decide what to do, in part because that's racy and you can end up failing to submit when there's work pending. More information is in the documentation.

Checking that a Work Item is Completed

If you need to be sure that potentially submitted work is not executed then set a flag in state that's exposed to the work handler, and have the handler detect it and do the cleanup. Then if necessary synchronize with the work item, first resubmitting it if necessary so that the handler can perform state cleanup.

Finding the Containing Instance in a Work Handler

Be aware that when using an embedded struct k_work_delayable the handler will be invoked using a pointer to the contained work structure in that delayable structure. The internals of that structure are not public API, and it is not permitted to use CONTAINER_OF relative to the struct work* work passed to the handler. The correct way of getting a pointer to a structure that contains a delayable work structure is exemplified by

struct k_work_delayable *dwork = k_work_delayable_from_work(work);

@pabigot
Copy link
Collaborator Author

pabigot commented Mar 7, 2021

NB: This comment will be updated as things change

The following PRs provide examples of updating work API to avoid the problematic idioms identified above.

Other related PRs:

@pabigot
Copy link
Collaborator Author

pabigot commented Mar 8, 2021

NB: This comment will be updated as things change

Changes to original description:

2021-03-08

  • Sanction use of k_work_is_pending() to assess that the work item is not being used, so state that the handler may reference should be available for an update and resubmission. (This idiom was a bug in the previous k_work implementation, where changes to data associated with a "not pending" work item could cause read/write conflicts with in-progress access of that data from the work handler.)

2021-03-17

  • Complete replacement of Best Practices section.

2021-03-31

  • Added section above with areas where conversions are necessary and their assigness, linking to rote conversion commits in mass conversion of k_work API #33924 as a starting point
  • Add reference to example of obtaining the container of a structure that has a struct k_work_delayable instance given the struct work * pointer passed to a work handler.

2021-04-16

  • Document best practice of commenting at point of code the expected result of using a racy construct such as an unchecked asynchronous cancel.

@jharris-intel
Copy link
Contributor

Specifically: when new work comes in, just submit it. Do not attempt to determine whether the submission is necessary. Have the handler determine whether there is more work to be done, and exit quickly if there isn't.

Could you please elaborate on this? I don't see how you can do this in the general case without forcing dynamic allocation of work queue items.

In particular: how can you know if the work queue item is safe to reinitialize?

Can you provide a simple example of what this looks like for a function that wants to have no more than, say, four work queue items outstanding at a time, erroring out any additional requests over said limit?

@pabigot
Copy link
Collaborator Author

pabigot commented Mar 9, 2021

Specifically: when new work comes in, just submit it. Do not attempt to determine whether the submission is necessary. Have the handler determine whether there is more work to be done, and exit quickly if there isn't.

Could you please elaborate on this? I don't see how you can do this in the general case without forcing dynamic allocation of work queue items.

In particular: how can you know if the work queue item is safe to reinitialize?

The majority of uses of work queue items in zephyr are dedicated to a single handler function, with operation-specific state provided by embedding the work item in a container that holds that state, and the container statically allocated. k_work_init() is invoked before anything's submitted. Once that's done the item can be submitted at any time, with the submission either having effect (if it's not in a queue, it's added to a queue) or no effect (when it's already in a queue, it stays there).

There may be some cases where a work item is re-initialized after it's already been submitted, but those have tended to be errors, or unnecessary. If it would be necessary the solution would be to protect submitting the item with a lock, then within the lock confirming the item is not busy (not delayed, queued, or running with or without an in-progress cancellation), which is the state in which it can, if necessary, be reconfigured.

Can you provide a simple example of what this looks like for a function that wants to have no more than, say, four work queue
items outstanding at a time, erroring out any additional requests over said limit?

Not conveniently, no, because I don't understand the question or use case. Work items aren't pooled, and a work queue can hold as many work items as are submitted to it: the links are part of each work item.

@jharris-intel
Copy link
Contributor

jharris-intel commented Mar 9, 2021

Work items aren't pooled, and a work queue can hold as many work items as are submitted to it: the links are part of each work item.

It's not the work queue that I'm concerned about here, it's the work queue items, and in particular data associated with the work queue items.

The current approach appears to solve half of the problem.

Again, coming back to the example: I wish to have no more than 4 outstanding work items for deferred_foo() . I have a static pool of 4 struct foo_work {struct k_work work; struct non_atomic_struct data;}, so far so good.

I receive a request try_run_foo(new_data). Now what?

How can I determine when it is safe to reinitialize a work queue item from my pool (in particular, additional data associated with a work queue item, a la CONTAINER_OF) without running into the exact same "checking whether work is pending before deciding whether to submit/schedule" antipattern?

If I wasn't concerned about race conditions / cancellation / etc, this would be simple (but then again, so would the k_work API!):

struct non_atomic_struct {
    uint8_t a;
    uint64_t b;
};
struct foo_work {
    struct k_work work;
    struct non_atomic_struct data;
};

void deferred_foo(struct k_work *work {
    struct foo_work *data = CONTAINER_OF(work, struct foo_work, work);
    do_something_with(data->data);
}

static struct foo_work[4] work_items;

void init(void) {
    for (size_t i = 0; i < 4; i++) {
        k_work_init(&foo_work[i].work, &deferred_foo);
    }
}

bool try_run_foo(non_atomic_struct new_data) {
    for (size_t i = 0; i < 4; i++) {
        if (!k_work_is_pending(&foo_work[i].work)) {
            foo_work[i].data = new_data;
            k_work_submit(data);
            return true;
        }
    }
    return false;
}

I can't do the "submit and see if it works" approach because of data.

I can e.g. wrap this entire thing in a mutex, but at that point the work queue API becomes largely useless here. I'm wrapping the work queue API in another work queue API that still has to spend a bunch of effort making sure that cancellation and such work - in which case it's easier and faster to directly reimplement the work queue API and avoid using the k_work API at all... which is rather unfortunate.


Note that e.g. the usb subsystem does something very similar to the above, with CONFIG_USB_MAX_NUM_TRANSFERS max work queue items outstanding.

@jukkar
Copy link
Member

jukkar commented Mar 9, 2021

Good notes, I did not check the documentation but some kind of BKM (Best Known Methods) about the usage would be nice to have in the documentation.

@pabigot
Copy link
Collaborator Author

pabigot commented Mar 9, 2021

Specifically: when new work comes in, just submit it. Do not attempt to determine whether the submission is necessary. Have the handler determine whether there is more work to be done, and exit quickly if there isn't.

Could you please elaborate on this? I don't see how you can do this in the general case without forcing dynamic allocation of work queue items.

OK, so let me go back to this original comment and leave the general case of a pool of work items aside.

Lots of Zephyr use of delayable work items are doing things like:

k_delayed_work_cancel();  // not checked for success
do stuff;
if (stuff_condition) {
   k_delayed_work_submit();
}

But the cancel does not guarantee that whatever data you operate on in stuff() is not being acted on by a work item, because the work item can still be running. So it's wrong. You can't just avoid a lock by asynchronously canceling the work item and assuming there's no longer a risk of multiple access.

What I'd rather see instead is:

take lock on stuff;
inspect stuff;
if (stuff_condition) {
   update stuff;
   submit work();
} else {
   update stuff to mark nop;
   submit work immediately(); // when it's done, everything's clean.
}
release lock on stuff;  // so work item can clean up

I can't assess for anybody whether this is easier than doing the above locking and also implementing your own work thread; I think it is, but that's up to you. All I want is for people to stop using k_work_cancel() as a means of avoiding locks.

There are going to be some cases where you still don't need a lock because of the way items are submitted in a specific system, but they may not be as common as I'd hoped. In fact that's why #33109 (comment) is giving me a headache, and the rework to make it OK is going to be significantly more complex.

When I have a handle on that I'll update the best known practices.

@jharris-intel
Copy link
Contributor

but they may not be as common as I'd hoped

Yeah...

You can wrap the entire thing in a lock, sometimes. And now you're wrapping a locking API in your own locking API, which is has unfortunate implications for performance, especially on SMP, and especially as k_work is often used in things that are relatively performance-sensitive (e.g. offload from ISRs).

I am also somewhat confused as to why cancel can't be synchronous/blocking, but the suggested replacement is a synchronous/blocking lock call.

(Spinlock? Or normal lock? If spinlock, ow. If normal lock, what happens when you're in an ISR, which is one of the suggested work queue usecases?)


One possibility for some of these cases might be to have a separate optional reserve step. This would allow many of these cases to work. (Then the implementation, if carefully constructed, can avoid having to lock for the actual submit after the reserve.)

That is something like:

bool try_run_foo(non_atomic_struct new_data) {
    for (size_t i = 0; i < 4; i++) {
        if (k_work_try_reserve(&foo_work[i].work)) {
            foo_work[i].data = new_data;
            if (k_work_submit_reserved(data)) { /* fails if work item was cancelled while reserved */
                return true;
            }
        }
    }
    return false;
}

@pabigot
Copy link
Collaborator Author

pabigot commented Mar 9, 2021

I am also somewhat confused as to why cancel can't be synchronous/blocking, but the suggested replacement is a synchronous/blocking lock call.

"lock" wasn't intended to specify a technology.

Please review the API. Cancellation as well as submission/scheduling must be supported from ISRs, so a non-blocking option is required, but there is blocking API as well.

(Spinlock? Or normal lock? If spinlock, ow. If normal lock, what happens when you're in an ISR, which is one of the suggested work queue usecases?)

Probably spinlock or state recorded in an atomic_t; it's not often clear the context in which functions might be invoked. In cases where sleeping is allowed the subsystem maintainer needs to get involved to determine the impact of doing so.

@jharris-intel
Copy link
Contributor

"lock" wasn't intended to specify a technology.

I didn't say it did - in fact I said otherwise:

(Spinlock? Or normal lock? If spinlock, ow. If normal lock, what happens when you're in an ISR, which is one of the suggested work queue usecases?)

Fundamentally, if you're trying to lock something in a nonblocking context, you have to either:

  1. Occasionally error out.
  2. Occasionally spin (or deadlock)
  3. "doctor it hurts; well don't do that" -> i.e. make sure you never get into the nonblocking context with the lock held by something else.

This is agnostic of the lock implementation you are using.

If, as you note, previously you were doing

k_delayed_work_cancel();
do stuff;
if (stuff_condition) {
   k_delayed_work_submit();
}

in a nonblocking context (which was one of the suggested usecases of k_work... namely ISR offload), and you replace it with a lock/unlock-based replacement based on your suggestion (for reference:

take lock on stuff;
inspect stuff;
if (stuff_condition) {
   update stuff;
   submit work();
} else {
   update stuff to mark nop;
   submit work immediately(); // when it's done, everything's clean.
}
release lock on stuff;  // so work item can clean up

)

you'll have some combination of:

  1. Failing occasionally, because you can't take the lock.
  2. Spinning in ISR context (ow).
  3. Spending a fair amount of time with interrupts disabled. (Even worse on SMP.)

I am aware that the old version was broken in practice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Kernel Meta A collection of features, enhancements or bugs
Projects
None yet
Development

No branches or pull requests

4 participants