Test `injection_queue_depth_multi_thread` is flaky #6847

Darksonn · 2024-09-14T14:55:21Z

This test has been observed to fail in CI:

tokio/tokio/tests/rt_unstable_metrics.rs

Lines 642 to 668 in 83e922f

 #[test] 

 fn injection_queue_depth_multi_thread() { 

 let rt = threaded(); 

 let metrics = rt.metrics(); 

 let barrier1 = Arc::new(Barrier::new(3)); 

 let barrier2 = Arc::new(Barrier::new(3)); 

 // Spawn a task per runtime worker to block it. 

 for _ in 0..2 { 

 let barrier1 = barrier1.clone(); 

 let barrier2 = barrier2.clone(); 

 rt.spawn(async move { 

 barrier1.wait(); 

 barrier2.wait(); 

 }); 

 } 

 barrier1.wait(); 

 for i in 0..10 { 

 assert_eq!(i, metrics.injection_queue_depth()); 

 rt.spawn(async {}); 

 } 

 barrier2.wait(); 

 }

To close this issue, figure out why it is failing and fix it.

jofas · 2024-09-15T12:11:07Z

I presume it's these two recent jobs that timed out

that have been affected by the flakiness of injection_queue_depth_multi_thread? I'd assume so because if this assert

tokio/tokio/tests/rt_unstable_metrics.rs

Line 663 in 83e922f

assert_eq!(i, metrics.injection_queue_depth());

fails, the main thread panics and we never get to synchronise on barrier2

tokio/tokio/tests/rt_unstable_metrics.rs

Line 667 in 83e922f

barrier2.wait();

here, causing the test to run forever (or until CI times out).

Darksonn · 2024-09-16T10:37:40Z

Yes. Good point with the assert. That explains why it times out instead of failing normally.

jofas · 2024-09-16T12:26:25Z

I wonder if a first step¹ would be to convert the assert into an if-statement where we'd panic only after calling barrier2.wait(). Then we'd get better diagnostics but more importantly wouldn't time out CI any more.

In case this isn't a quick fix ↩

Darksonn · 2024-09-23T16:12:23Z

Making it fail instead of hanging would be nice. I submitted a PR for that. (But the flakiness is not fixed by this.)

jofas · 2024-09-24T08:30:04Z

Interestingly enough, after I pulled your changes, the test still stalls instead of failing. It seems like I was wrong about the assert! causing a panic which makes us never reach the second barrier.

At least I built this highly professional bash script—which I'm going to share here in case someone else wants to use it—that finally allowed me to trigger the flakiness on my local x86_64 Linux system:

run_test() {
  cargo test \
    --all-features \
    --test rt_metrics \
    injection_queue_depth_multi_thread \
    -- --nocapture
}

i=0
while run_test; do
  let i++
  echo -e "$i: \e[32m☑\e[0m"
done

Now that I can litter my fork with debug print statements, I'll investigate some more 🙂

jofas · 2024-09-24T08:40:15Z

My preliminary findings after adding debug prints before and after the line¹ are that we deadlock on barrier1.wait() here:

tokio/tokio/tests/rt_metrics.rs

Line 84 in 82628b8

barrier1.wait();

Best way to debug control flow ↩

jofas · 2024-09-24T13:12:04Z

Uhm, isn't the flakiness just a mundane case of accidentally blocking the runtime before it can schedule the second task?

Darksonn · 2024-09-24T14:08:21Z

The runtime isn't supposed to get blocked by this. If there is an idle worker thread and a runnable task, the worker thread must pick up a runnable task. The only exception is the LIFO slot which is not relevant here.

jofas · 2024-09-24T16:37:16Z

Could there be some weirdness around parked threads? The deadlocked test parks the second worker before it gets the second task in its work queue and does not unpark it again on my machine.

carllerche · 2024-09-24T17:03:44Z

Looking at the source, my guess is one worker gets both tasks off the injection queue in one batch and doesn’t notify a peer to steal.

@jofas would you be able to try to isolate this case as a loom test?

jofas · 2024-09-24T18:24:18Z

I'd love to give it a try for sure. This is quite exciting. @Darksonn may I ask you questions in case I get stuck on this?

Darksonn · 2024-09-24T18:53:07Z

Yes, feel free to send questions my way. A loom test would be a good start.

Darksonn added E-help-wanted Call for participation: Help is requested to fix this issue. A-tokio Area: The main tokio crate E-medium Call for participation: Experience needed to fix: Medium / intermediate M-metrics Module: tokio/runtime/metrics labels Sep 14, 2024

Darksonn changed the title ~~injection_queue_depth_multi_thread~~ Test injection_queue_depth_multi_thread is flaky Sep 14, 2024

Darksonn mentioned this issue Sep 23, 2024

metrics: don't hang in injection_queue_depth_multi_thread test #6862

Merged

jofas mentioned this issue Sep 28, 2024

Loom test for deadlock observed in tokio's test suite #6876

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test `injection_queue_depth_multi_thread` is flaky #6847

Test `injection_queue_depth_multi_thread` is flaky #6847

Darksonn commented Sep 14, 2024

jofas commented Sep 15, 2024

Darksonn commented Sep 16, 2024

jofas commented Sep 16, 2024

Darksonn commented Sep 23, 2024 •

edited

Loading

jofas commented Sep 24, 2024 •

edited

Loading

jofas commented Sep 24, 2024

jofas commented Sep 24, 2024

Darksonn commented Sep 24, 2024

jofas commented Sep 24, 2024

carllerche commented Sep 24, 2024

jofas commented Sep 24, 2024

Darksonn commented Sep 24, 2024

Test injection_queue_depth_multi_thread is flaky #6847

Test injection_queue_depth_multi_thread is flaky #6847

Comments

Darksonn commented Sep 14, 2024

jofas commented Sep 15, 2024

Darksonn commented Sep 16, 2024

jofas commented Sep 16, 2024

Footnotes

Darksonn commented Sep 23, 2024 • edited Loading

jofas commented Sep 24, 2024 • edited Loading

jofas commented Sep 24, 2024

Footnotes

jofas commented Sep 24, 2024

Darksonn commented Sep 24, 2024

jofas commented Sep 24, 2024

carllerche commented Sep 24, 2024

jofas commented Sep 24, 2024

Darksonn commented Sep 24, 2024

Test `injection_queue_depth_multi_thread` is flaky #6847

Test `injection_queue_depth_multi_thread` is flaky #6847

Darksonn commented Sep 23, 2024 •

edited

Loading

jofas commented Sep 24, 2024 •

edited

Loading