Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Static Executors #112

Merged
merged 32 commits into from
May 12, 2024
Merged

Static Executors #112

merged 32 commits into from
May 12, 2024

Conversation

james7132
Copy link
Contributor

@james7132 james7132 commented Apr 11, 2024

Resolves #111. Creates a StaticExecutor type under a feature flag and allows constructing it from an Executor via Executor::leak. Unlike the executor it came from, it's a wrapper around a State and omits all changes to active.

Note, unlike the API proposed in #111, this PR also includes a unsafe StaticExecutor::spawn_scoped for spawning non-'static tasks, where the caller is responsible for ensuring that the task doesn't outlive the borrowed state. This would be required for Bevy to migrate to this type, where we're currently using lifetime transmutation on Executor to enable Thread::scope-like APIs for working with borrowed state. StaticExecutor does not have an external lifetime parameter so this approach is infeasible without such an API.

The performance gains while using the type are substantial:

single_thread/executor::spawn_one
                        time:   [1.6157 µs 1.6238 µs 1.6362 µs]
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe
single_thread/executor::spawn_batch
                        time:   [28.169 µs 29.650 µs 32.196 µs]
Found 19 outliers among 100 measurements (19.00%)
  10 (10.00%) low severe
  3 (3.00%) low mild
  3 (3.00%) high mild
  3 (3.00%) high severe
single_thread/executor::spawn_many_local
                        time:   [6.1952 ms 6.2230 ms 6.2578 ms]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe
single_thread/executor::spawn_recursively
                        time:   [50.202 ms 50.479 ms 50.774 ms]
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe
single_thread/executor::yield_now
                        time:   [5.8795 ms 5.8883 ms 5.8977 ms]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

multi_thread/executor::spawn_one
                        time:   [1.2565 µs 1.2979 µs 1.3470 µs]
Found 8 outliers among 100 measurements (8.00%)
  7 (7.00%) high mild
  1 (1.00%) high severe
multi_thread/executor::spawn_batch
                        time:   [38.009 µs 43.693 µs 52.882 µs]
Found 22 outliers among 100 measurements (22.00%)
  21 (21.00%) high mild
  1 (1.00%) high severe
Benchmarking multi_thread/executor::spawn_many_local: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 386.6s, or reduce sample count to 10.
multi_thread/executor::spawn_many_local
                        time:   [27.492 ms 27.652 ms 27.814 ms]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
Benchmarking multi_thread/executor::spawn_recursively: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 16.6s, or reduce sample count to 30.
multi_thread/executor::spawn_recursively
                        time:   [165.82 ms 166.04 ms 166.26 ms]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
multi_thread/executor::yield_now
                        time:   [22.469 ms 22.649 ms 22.798 ms]
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) low severe
  3 (3.00%) low mild

single_thread/leaked_executor::spawn_one
                        time:   [1.4717 µs 1.4778 µs 1.4832 µs]
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) low severe
  2 (2.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe
single_thread/leaked_executor::spawn_many_local
                        time:   [4.2622 ms 4.3065 ms 4.3489 ms]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) low mild
single_thread/leaked_executor::spawn_recursively
                        time:   [26.566 ms 26.899 ms 27.228 ms]
single_thread/leaked_executor::yield_now
                        time:   [5.7200 ms 5.7270 ms 5.7342 ms]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

multi_thread/leaked_executor::spawn_one
                        time:   [1.3755 µs 1.4321 µs 1.4892 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
multi_thread/leaked_executor::spawn_many_local
                        time:   [4.1838 ms 4.2394 ms 4.2989 ms]
Found 7 outliers among 100 measurements (7.00%)
  7 (7.00%) high mild
multi_thread/leaked_executor::spawn_recursively
                        time:   [43.074 ms 43.159 ms 43.241 ms]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild
multi_thread/leaked_executor::yield_now
                        time:   [23.210 ms 23.257 ms 23.302 ms]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild

@james7132
Copy link
Contributor Author

Hmmm, miri clearly does not like this even though the leak is intentional. Not sure how to tackle this without wholesale disabling leak detection.

@james7132
Copy link
Contributor Author

Another alternative is to follow through with the repr(transparent) use detailed in #111, and instead allow LeakedExecutor to be const-constructible (would require ConcurrentQueue::unbounded to be const). This way it could be used in a static.

Copy link
Member

@notgull notgull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be useful for LocalExecutor as well?

Paging @smol-rs/admins as I would appreciate another review as well

src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated
Comment on lines 773 to 776
pub unsafe fn spawn_scoped<'a, T: Send + 'static>(
&self,
future: impl Future<Output = T> + Send + 'a,
) -> Task<T> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this kind of lifetime relaxation be useful on Executor/LocalExecutor as well?

Copy link
Contributor Author

@james7132 james7132 Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, it would at the very minimum remove the need for std::mem::transmutes, which I would consider much more prone to unsoundness due to uncertainty around layout, to implement ThreadPool::scope-style APIs. Though it's pretty evidently not easy to properly use such an API, and proving the soundness even harder.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that sounds interesting, could you elaborate?

Copy link
Contributor Author

@james7132 james7132 Apr 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary API in question is bevy_tasks::TaskPool::scope. The actual implementation can be seen here, though it's pretty tangled due to our need to explicitly schedule tasks onto a separate executor since !Send data must be accessed and dropped from the thread it was created from.

The function is technically non-blocking since the thread yields to the executor while waiting for the tasks in the scope to complete.

src/lib.rs Outdated Show resolved Hide resolved
@james7132 james7132 requested a review from notgull April 12, 2024 01:37
@yoshuawuyts
Copy link
Member

It took me a second to wrap around what this does conceptually, but now that I get it I think this is an interesting idea as an optimization. I have no opinion on the actual impl, so I'll defer to someone else to review.

As perhaps a tiny bikeshed point to raise: when I saw LeakedExecutor it initially threw me a little, as leaking is generally undesirable, and I wasn't sure why we would want to leak an executor?

However, I think we could reframe what this does as: "it enables an executor to exist for the duration of the entire program". This is something we often called "static" (e.g. 'static, static, lazy_static!) and I wonder if maybe the name StaticExecutor might work? — It's a genuine question; I'm not yet convinced this is better per se, but I wanted to raise the option to discuss.

@fogti
Copy link
Member

fogti commented Apr 12, 2024

can we ensure/check that the LeakedExecutor methods are correctly inlined (which isn't direct clear when they contain await and such)

Copy link
Member

@notgull notgull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a fan of StaticExecutor, but I'd still like the method to be named leak

.github/workflows/ci.yml Show resolved Hide resolved
@notgull
Copy link
Member

notgull commented Apr 12, 2024

can we ensure/check that the LeakedExecutor methods are correctly inlined (which isn't direct clear when they contain await and such)

I don't know how this would be automatically checked, or if it really matters, given that the benchmarks seem to be faster than normal anyways

@james7132 james7132 changed the title LeakedExecutor Static Executors Apr 13, 2024
src/lib.rs Outdated Show resolved Hide resolved
@james7132 james7132 requested a review from notgull April 14, 2024 06:44
Copy link
Member

@notgull notgull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tentative approval until the next concurrent-queue is released, but I'll wait for @fogti's review as well

@james7132
Copy link
Contributor Author

james7132 commented Apr 15, 2024

@fogti we could desugar the user-facing versions of it and force a #[inline(always)] annotation on the State implementations for the functions. It definitely looks worse in terms of user-facing documentation, though I don't think it's a breaking change to do so:

// before
pub fn async run<T>(&self, fut: impl Future<Output = T>) -> T;
// after
pub fn run<'r, T: 'r>(&'r self, future: impl Future<Output = T> + 'r) -> impl Future<Output = T> + 'r;

@fogti
Copy link
Member

fogti commented Apr 15, 2024

@james7132 of course we can (and in the past I would've suggested exactly that), but it would be a good idea to check at least if the async functions get properly inlined with current stable rust when marked with #[inline(always)], cause just because it was a problem in the past, it might not be anymore (which was suggested elsewhere a few months ago).

@fogti
Copy link
Member

fogti commented Apr 15, 2024

cargo bench --all --all-features currently failed. this is a hard blocker.

benches/executor.rs Outdated Show resolved Hide resolved
Cargo.toml Outdated Show resolved Hide resolved
@notgull
Copy link
Member

notgull commented Apr 16, 2024

Hmm, are benchmarks not built in CI? Odd.

@fogti
Copy link
Member

fogti commented Apr 16, 2024

Some of the benchmarks appear to be a bit too small, some have a very large variance, making them unsuitable to properly measure their performance in larger systems...

multi_thread/executor::spawn_batch
                        time:   [31.896 µs 35.826 µs 39.825 µs]
                        change: [-74.713% +27.283% +332.75%] (p = 0.84 > 0.05)
                        No change in performance detected.
perf report, top overhead
# Overhead  Command          Shared Object                    Symbol                                                                                                                                                       
# ........  ...............  ...............................  ....................................................
#
    26.19%  executor-6b7a70  executor-6b7a7073e5281258        [.] concurrent_queue::unbounded::Unbounded<T>::pop
     9.43%  executor-6b7a70  executor-6b7a7073e5281258        [.] concurrent_queue::unbounded::Unbounded<T>::push
     7.15%  executor-6b7a70  executor-6b7a7073e5281258        [.] criterion::stats::univariate::kde::Kde<A,K>::map
     5.37%  executor-6b7a70  executor-6b7a7073e5281258        [.] async_executor::Runner::runnable::{{closure}}
     4.09%  executor-6b7a70  executor-6b7a7073e5281258        [.] async_executor::State::notify
     3.73%  executor-6b7a70  executor-6b7a7073e5281258        [.] concurrent_queue::bounded::Bounded<T>::pop
     2.57%  executor-6b7a70  executor-6b7a7073e5281258        [.] std::sys::unix::locks::futex_mutex::Mutex::lock_contended
     2.33%  executor-6b7a70  executor-6b7a7073e5281258        [.] concurrent_queue::bounded::Bounded<T>::push_or_else
     2.19%  executor-6b7a70  executor-6b7a7073e5281258        [.] async_task::raw::RawTask<F,T,S,M>::run
     2.08%  executor-6b7a70  executor-6b7a7073e5281258        [.] <async_task::task::Task<T,M> as core::ops::drop::Drop>::drop
     1.90%  executor-6b7a70  executor-6b7a7073e5281258        [.] async_task::raw::RawTask<F,T,S,M>::run

@james7132
Copy link
Contributor Author

I also noted the variance with spawn_batch, but that particular benchmark shouldn't be affected that strongly by this PR, beyond moving the try_tick/tick/run implementations to State. It's interesting to see the unbounded variants of the queue at the top, which corroborates potential gains from the previous attempt to directly enqueue onto local queues.

For a more realistic workload, I tested this against Bevy's many_foxes stress tests, and saw a 66% reduction in time spawning tasks in the system executor:

322801183-83b1040a-b19d-4bdb-8e94-f22ba86951ba

Cargo.toml Outdated Show resolved Hide resolved
Copy link
Member

@notgull notgull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@notgull
Copy link
Member

notgull commented May 12, 2024

Once this is rebased it can be merged.

@james7132 james7132 requested a review from fogti May 12, 2024 23:22
@notgull notgull merged commit 924b453 into smol-rs:master May 12, 2024
8 checks passed
@notgull notgull mentioned this pull request May 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

A leaked Executor does not need to track active tasks
4 participants