Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

twister with --device-testing sometimes overlaps tests #32679

Closed
andyross opened this issue Feb 25, 2021 · 0 comments · Fixed by #32784
Closed

twister with --device-testing sometimes overlaps tests #32679

andyross opened this issue Feb 25, 2021 · 0 comments · Fixed by #32784
Assignees
Labels
area: Twister Twister bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug

Comments

@andyross
Copy link
Contributor

While working on #32678 I discovered that twister isn't perfectly serialized when doing device testing. Sometimes I'll see a second test fire up unbidden, which then clobbers the running one and (for the specific case of that script) causes a hang where the scripts (which need to wait for each other) never complete. Twister then times out the next 1-3 tests before getting back in sync.

You can watch this happen with "ps" in a loop, I've seen as many as three log reading scripts running at the same time.

This seems much more common after a failure, and especially after a (legitimate, real device hang) timeout.

The rule obviously has to be that no script shall be started before the scripts from the previous run are killed and known to be dead.

@andyross andyross added bug The issue is a bug, or the PR is fixing a bug area: Twister Twister labels Feb 25, 2021
andyross pushed a commit to andyross/zephyr that referenced this issue Mar 2, 2021
CPython is sometimes described as "single threaded" due to the GIL,
but the interpreter will still "preemptively" switch between threads
(the details seem poorly documented).

So the time between checking whether acquire is 1 and decrementing the
count could result in more than one thread seeing an "available"
device, and more than one test being run (simultaneously, on the same
physical device!).  We have a big herd of threads all polling for
this, so in a large test run this would happen maybe one time out of
20-30 tries.

Use a lock.  Also remove the very similar looking
DUT.get_available_device() method, which had the same bug but appears
to be dead code.

Fixes zephyrproject-rtos#32679

Signed-off-by: Andy Ross <[email protected]>
@nashif nashif added the priority: medium Medium impact/importance bug label Mar 2, 2021
andyross pushed a commit to andyross/zephyr that referenced this issue Mar 3, 2021
CPython is sometimes described as "single threaded" due to the GIL,
but the interpreter will still "preemptively" switch between threads
(the details seem poorly documented).

So the time between checking whether acquire is 1 and decrementing the
count could result in more than one thread seeing an "available"
device, and more than one test being run (simultaneously, on the same
physical device!).  We have a big herd of threads all polling for
this, so in a large test run this would happen maybe one time out of
20-30 tries.

Use a lock.  Also remove the very similar looking
DUT.get_available_device() method, which had the same bug but appears
to be dead code.

Fixes zephyrproject-rtos#32679

Signed-off-by: Andy Ross <[email protected]>
nashif pushed a commit that referenced this issue Mar 4, 2021
CPython is sometimes described as "single threaded" due to the GIL,
but the interpreter will still "preemptively" switch between threads
(the details seem poorly documented).

So the time between checking whether acquire is 1 and decrementing the
count could result in more than one thread seeing an "available"
device, and more than one test being run (simultaneously, on the same
physical device!).  We have a big herd of threads all polling for
this, so in a large test run this would happen maybe one time out of
20-30 tries.

Use a lock.  Also remove the very similar looking
DUT.get_available_device() method, which had the same bug but appears
to be dead code.

Fixes #32679

Signed-off-by: Andy Ross <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Twister Twister bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants