Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kernel: SMP scheduling: edge case with pinned threads #69005

Closed
peter-mitsis opened this issue Feb 15, 2024 · 9 comments · Fixed by #73884
Closed

kernel: SMP scheduling: edge case with pinned threads #69005

peter-mitsis opened this issue Feb 15, 2024 · 9 comments · Fixed by #73884
Assignees
Labels
area: Kernel area: SMP Symmetric multiprocessing bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug

Comments

@peter-mitsis
Copy link
Collaborator

Describe the bug
SMP systems (with IPI support) that make use of pinned threads may fail to schedule the highest priority thread that is ready to run.

The last CPU in a SMP system to service an IPI and switch its thread out in favour of a higher priority thread pinned to that CPU can result in that switched-out thread not getting scheduled anywhere until another schedule point is reached.

To Reproduce

Imagine a system with 2 CPUs. In this system there are the following four threads to consider.
T1 - executing on CPU0, not pinned, has priority 10
T2 - execution on CPU1, not pinned, has priority 15
T3 - pended, pinned to CPU0, has priority 5
T4 - pended, not pinned, has priority 12

Some event happens and CPU1 unpends both T3 and T4 and makes them executable at the same time. Since threads were made ready, flag_ipi() is called and the pending_ipi flag is set. CPU1 hits its reschedule point to switch out T2 for T4 and issues a call to signal_pending_ipi(). This will clear the pending_ipi flag and trigger the IPI. CPU0 will get the IPI and hit a reschedule point. It is at this point that CPU0 will switch out T1 for T3. At this stage of events, we now have ...

Final Summary
T1 - priority 10, ready but not executing anywhere
T2 - priority 15, ready but not executing anywhere
T3 - priority 5, ready and executing on CPU0
T4 - priority 12, ready and executing on CPU1

Expected behavior

Based on the Final Summary shown above, since T1 is of a higher priority than T4 (10 vs 12), it is expected that T1 would be executing on CPU1 and not T4.

A clear and concise description of what you expected to happen.

Impact

It is recognised that although T1 will eventually get scheduled when CPU1 reaches its next scheduling point, we can not know when that will be. It might be in a few hundred microseconds, or it could be seconds later, or even never. Until that happens, we essentially have a type of priority inversion on T1. In a system with a tickless kernel, this could be a long long time.

Fortunately, this inversion can only apply to systems that pin threads to execute on a subset of the CPUs in the system.

Additional context
I think that in general, if the CPUs that are switching our for pinned threads are not the first ones doing so, then we can encounter this issue.

** Possible Solution **
If a thread in an SMP is being switched out for a pinned or CPU restricted thread, then we could signal another IPI to the other non-restricted CPUs. Although this may lead to more interrupts and ISRs (which would have a performance impact), we would know that this type of priority inversion would be kept to a minimum.

@peter-mitsis peter-mitsis added the bug The issue is a bug, or the PR is fixing a bug label Feb 15, 2024
@henrikbrixandersen henrikbrixandersen added area: Kernel area: SMP Symmetric multiprocessing labels Feb 15, 2024
@henrikbrixandersen
Copy link
Member

@peter-mitsis @andyross Please evaluate and assign a priority to this issue.

@andyross
Copy link
Contributor

It is at this point that CPU0 will switch out T1 for T3.

Which modifies the run queue, which will trigger another IPI, right? I'm pretty sure we get this right already? Is there an existence proof of the bug?

@peter-mitsis
Copy link
Collaborator Author

Which modifies the run queue, which will trigger another IPI, right?

I may be mistaken, but I suspect that you are referring to the call z_requeue_current() from within do_swap(). That routine does indeed call signal_pending_ipi(), however signal_pending_ipi() will only call arch_sched_ipi() if pending_ipi is true, and it is only set to true in flag_ipi().

The routine flag_ipi() is called from a few places, but I think that the only relevant place in this scenario is ready_thread() (which is called by z_ready_thread()). As you know, z_ready_thread() can get invoked when unblocking a thread from one of our kernel objects. In the scenario I described, when CPU0 switches out T1 for T3, the pending_ipi flag is currently false because the original call to arch_sched_ipi() on another CPU cleared it and nothing has happened since to set it again.

I'm pretty sure we get this right already?

As long as the threads involved do not have any restrictions as to which CPU they may run on, everything is correct and it works. The problem only rears its head in SMP when a system has a thread that is not allowed to run on one or more CPUs. This is because a CPU no longer picks the highest priority ready thread, but rather the highest priority ready thread that it is allowed to run.

Is there an existence proof of the bug?

I did create a test project to force this scenario, and I am confident that I succeeded. However, due to its strict need to control for scheduling points it came out somewhat harder to follow and I need to clean it up (and hopefully simplify it) before I post it here.

@peter-mitsis
Copy link
Collaborator Author

I am uploading a zip file that contains a test case to reproduce the failure.

smp_ipi_test.zip

Build it using ...

cd <test directory>
west build -p -b qemu_x86_64 ./ -t run

When it fails, you should observe something like ...

*** Booting Zephyr OS build v3.6.0-rc2-94-g865128fff59f ***
T3 switched in!.

To make it pass, edit the source file and remove/comment the following line ...

#define FORCE_FAILURE

Rebuild, and the output when it passes will look something like ...

*** Booting Zephyr OS build v3.6.0-rc2-94-g865128fff59f ***
This message should not appear!
T3 switched in!.
134.121.X4Y31
T1 - This message should show!

@peter-mitsis peter-mitsis added the priority: medium Medium impact/importance bug label Feb 20, 2024
@andyross
Copy link
Contributor

Oh... I think I see what you're saying. FWIW I don't think this really has anything to do with cpu_mask except as an exerciser strategy. It's a race/misfeature with the way we do preemption:

  1. When a CPU preempts one runnable thread with another, it does not signal an IPI synchronously, by design.[1]
  2. If the preempted thread happens to be higher priority than the running thread on another CPU, that means that instantaneously it will not run right away, even though it's higher priority.

It's true that #2 is unlikely to happen under typical scheduling, because the preempting thread will likely have preempted whatever CPU it was made runnable on, which (by construction) is running a lower priority thread than ours. But you can also make it runnable from an interrupt, and see this effect too.

I'm... frankly not sure this is a bug we want to fix. The situation corrects itself the very next time an IPI is signaled for any reason, or when the CPU running the "wrong" thread gets an interrupt for any reason, or if its current thread hits a scheduling point for any reason. I think that's largely acceptable: single CPU apps simply can't preempt except in those circumstances anyway (stated diferently: we're demanding SMP add behavior to fix a bug that all single-core apps need to deal with anyway)

The fix, if we really want it, would be to synchronously IPI every time we preempt on an SMP system, and that's extremely expensive.

[1] The assumption being that whatever IPI needed to happen already did when the run queue was modified making us preemptible. But that IPI might have already been handled by the time we get to the context switch.

@andyross
Copy link
Contributor

To clarify: you should be able to make this happen more simply (on a 2-CPU system in this context):

  1. Start "low priority" thread
  2. Start "medium priority" thread to run on the second CPU
  3. Enter an ISR that preempts the medium priority thread (irq_offload, whatever)
  4. In the ISR, start a high priority thread
  5. Verify that the low priority thread is still running even though medium "should" have migrated and preempted it.

Again though, the fix will hurt more than the problem. If we do this, let's for sure gate it behind an default-n kconfig.

@peter-mitsis
Copy link
Collaborator Author

Thanks for the insights. If this goes forward, I definitely see the value of having this as a Kconfig option whose default is 'n'. On that note, one idea I have been exploring to help with the cost of IPI cascades and IPIs in general is to target the IPIs only to relevant CPUs. If I am following things correctly, any time a thread is made available to run, we broadcast an IPI to all CPUs. As a result, each CPU has to go through the scheduler--even if to determine that it does not need to schedule in a new thread.

If we can target the IPIs to CPUs that are executing a lower priority thread, then that ought to give us some performance benefit.

Copy link

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

@peter-mitsis
Copy link
Collaborator Author

Now that the IPI delivery optimization has been merged, I am going to finish tackling this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Kernel area: SMP Symmetric multiprocessing bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants