Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use EntityHashMap<Entity, T> for render world entity storage for better performance #9903

Merged
merged 15 commits into from
Sep 27, 2023

Conversation

superdump
Copy link
Contributor

@superdump superdump commented Sep 22, 2023

Objective

  • Improve rendering performance, particularly by avoiding the large system commands costs of using the ECS in the way that the render world does.

Solution

  • Define EntityHasher that calculates a hash from the Entity.to_bits() by i | (i.wrapping_mul(0x517cc1b727220a95) << 32). 0x517cc1b727220a95 is something like u64::MAX / N for N that gives a value close to π and that works well for hashing. Thanks for @SkiFire13 for the suggestion and to @nicopap for alternative suggestions and discussion. This approach comes from rustc-hash (a.k.a. FxHasher) with some tweaks for the case of hashing an Entity. FxHasher and SeaHasher were also tested but were significantly slower.
  • Define EntityHashMap type that uses the EntityHashser
  • Use EntityHashMap<Entity, T> for render world entity storage, including:
    • RenderMaterialInstances - contains the AssetId<M> of the material associated with the entity. Also for 2D.
    • RenderMeshInstances - contains mesh transforms, flags and properties about mesh entities. Also for 2D.
    • SkinIndices and MorphIndices - contains the skin and morph index for an entity, respectively
    • ExtractedSprites
    • ExtractedUiNodes

Benchmarks

All benchmarks have been conducted on an M1 Max connected to AC power. The tests are run for 1500 frames. The 1000th frame is captured for comparison to check for visual regressions. There were none.

2D Meshes

bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d

--ordered-z

This test spawns the 2D meshes with z incrementing back to front, which is the ideal arrangement allocation order as it matches the sorted render order which means lookups have a high cache hit rate.

Screenshot 2023-09-27 at 07 50 45

-39.1% median frame time.

Random

This test spawns the 2D meshes with random z. This not only makes the batching and transparent 2D pass lookups get a lot of cache misses, it also currently means that the meshes are almost certain to not be batchable.

Screenshot 2023-09-27 at 07 51 28

-7.2% median frame time.

3D Meshes

many_cubes --benchmark

Screenshot 2023-09-27 at 07 51 57

-7.7% median frame time.

Sprites

NOTE: On main sprites are using SparseSet<Entity, T>!

bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite

--ordered-z

This test spawns the sprites with z incrementing back to front, which is the ideal arrangement allocation order as it matches the sorted render order which means lookups have a high cache hit rate.

Screenshot 2023-09-27 at 07 52 31

+13.0% median frame time.

Random

This test spawns the sprites with random z. This makes the batching and transparent 2D pass lookups get a lot of cache misses.

Screenshot 2023-09-27 at 07 53 01

+0.6% median frame time.

UI

NOTE: On main UI is using SparseSet<Entity, T>!

many_buttons

Screenshot 2023-09-27 at 07 53 26

+15.1% median frame time.

Alternatives

  • Cart originally suggested trying out SparseSet<Entity, T> and indeed that is slightly faster under ideal conditions. However, PassHashMap<Entity, T> has better worst case performance when data is randomly distributed, rather than in sorted render order, and does not have the worst case memory usage that SparseSet's dense Vec<usize> that maps from the Entity index to sparse index into Vec<T>. This dense Vec has to be as large as the largest Entity index used with the SparseSet.
  • I also tested PassHashMap<u32, T>, intending to use Entity.index() as the key, but this proved to sometimes be slower and mostly no different.
  • The only outstanding approach that has not been implemented and tested is to not clear the render world of its entities each frame. That has its own problems, though they could perhaps be solved.
    • Performance-wise, if the entities and their component data were not cleared, then they would incur table moves on spawn, and should not thereafter, rather just their component data would be overwritten. Ideally we would have a neat way of either updating data in-place via &mut T queries, or inserting components if not present. This would likely be quite cumbersome to have to remember to do everywhere, but perhaps it only needs to be done in the more performance-sensitive systems.
    • The main problem to solve however is that we want to both maintain a mapping between main world entities and render world entities, be able to run the render app and world in parallel with the main app and world for pipelined rendering, and at the same time be able to spawn entities in the render world in such a way that those Entity ids do not collide with those spawned in the main world. This is potentially quite solvable, but could well be a lot of ECS work to do it in a way that makes sense.

Changelog

  • Changed: Component data for entities to be drawn are no longer stored on entities in the render world. Instead, data is stored in a EntityHashMap<Entity, T> in various resources. This brings significant performance benefits due to the way the render app clears entities every frame. Resources of most interest are RenderMeshInstances and RenderMaterialInstances, and their 2D counterparts.

Migration Guide

Previously the render app extracted mesh entities and their component data from the main world and stored them as entities and components in the render world. Now they are extracted into essentially EntityHashMap<Entity, T> where T are structs containing an appropriate group of data. This means that while extract set systems will continue to run extract queries against the main world they will store their data in hash maps. Also, systems in later sets will either need to look up entities in the available resources such as RenderMeshInstances, or maintain their own EntityHashMap<Entity, T> for their own data.

Before:

fn queue_custom(
    material_meshes: Query<(Entity, &MeshTransforms, &Handle<Mesh>), With<InstanceMaterialData>>,
) {
    ...
    for (entity, mesh_transforms, mesh_handle) in &material_meshes {
        ...
    }
}

After:

fn queue_custom(
    render_mesh_instances: Res<RenderMeshInstances>,
    instance_entities: Query<Entity, With<InstanceMaterialData>>,
) {
    ...
    for entity in &instance_entities {
        let Some(mesh_instance) = render_mesh_instances.get(&entity) else { continue; };
        // The mesh handle in `AssetId<Mesh>` form, and the `MeshTransforms` can now
        // be found in `mesh_instance` which is a `RenderMeshInstance`
        ...
    }
}

@superdump
Copy link
Contributor Author

@cart just to clearly state that this PR does introduce performance regressions for sprites and UI as it switches them from SparseSet to PassHashMap. The question is - do we want to take a best case performance hit for better performance with random ordering and avoiding memory overuse?

@JMS55
Copy link
Contributor

JMS55 commented Sep 23, 2023

These benchmarks are likely to change once we switch the render phase sorting to be by pipeline, right?

@superdump
Copy link
Contributor Author

These benchmarks are likely to change once we switch the render phase sorting to be by pipeline, right?

These benchmarks in particular wouldn’t change as they’re all using one pipeline each. Beyond that, we would only implement that kind of sorting for opaque phases where draw order is about maximising performance and not about visual correctness. That means only many_cubes would be affected as 2D and UI only have transparent phases at the moment.

Comment on lines 215 to 219
#[inline]
fn write_u32(&mut self, i: u32) {
self.hash = i as u64;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a performance footgun. hashbrown uses the top 7 bits of the hash to do SIMD probing inside a group, and this will set all of them to 0, essentially nullifying that optimization.

Also note that Entity has two u32 fields, so it will call write_u32 twice and the second one will overwrite the result of the first one. Luckily the index is the second field of Entity, but we're not documenting this anywhere and it would be easy to break (especially considering the ongoing work to refactor identifiers in general).

Copy link
Contributor Author

@superdump superdump Sep 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't know it worked like that. What do you suggest?

The generation in the Entity doesn't really matter for these uses, so I did try using Entity.index() -> u32 as noted in the description but in some cases it was slower than using Entity for some reason. Just noting that the index is the bit I'm interested in using as a key for cases where the map is cleared each frame, in case that makes things simpler to improve.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't know it worked like that. What do you suggest?

You could try with a very simple hasher like rustc_hash. If you want to manually implement it for the u32 case, it's just (i as u64).wrapping_mul(0x517cc1b727220a95). The only concern I have with this is that it kinda randomizes the hash, so it pessimizes the case where entities are sorted.

An alternative for better memory locality might be keeping the entity index for the lower 32 bits while using the hash for the upper 32 bits, which should allow some useful top 7 bits. This could be implemented with ((i as u64).wrapping_mul(0x517cc1b727220a95) << 32) | (i as u64).
For 32 bit targets (e.g. wasm) this is a bit more complicated because hashbrown only considers the bottom 32 bits. Unfortunately the top 7 bits of the entity index are all zeros most of the time (you need 2^25 ~ 33M entities for them to start being non-zero). My guess is that you could get away with the previous method but done on a 16-16 bits split, e.g. (i.wrapping_mul(0x9e3779b9) << 16) | (i & 0xffff)

Just noting that the index is the bit I'm interested in using as a key for cases where the map is cleared each frame, in case that makes things simpler to improve.

Then using only index should be fine for your uses, but how do you ensure that whoever changes Entity in the future makes sure that the index field will be hashed last?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't like how implicit this is. PassHasher exists to pass a single full u64 hash. Giving it support for a u32 and then using it on each of the fields of Entity is a confusing misuse of that design imo.

Some ideas:

  1. A custom Entity Hash impl that combines the index and generation into a single u64
    • Probably very bad for some safety reason (in addition to cleanliness / maintainability), but could we transmute &self (&Entity) to &64?
  2. A custom EntityMap type that does the work for users. (ex: if we really want a entity.index->X map, we could just wrap an Entity api over PassHashMap<u64, X, PassHash> that casts entity.index to u64.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried a suggestion from @nicopap about doing a wrapping_mul by u32::MAX / golden ratio. I was thinking that as the hash is a u64, I should instead use u64::MAX / phi. But this made performance really really bad. Not sure what I did wrong. And ahash and others are not as bad as that was, which seems odd. My current guess without having had time to dig deeper is that wrapping_mul on a u64 uses u128 maths that just has really bad performance.

@cart options 1 and 2 don’t address skifire’s comment about the upper 7 bits being used for some simd optimisation. Option 2 would always have the upper 32 bits as all zeros, and option 1 would be mostly zeros unless people do a lot of respawning of entities.

Copy link
Contributor Author

@superdump superdump Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed write_u32, and implemented Hash for Entity manually by using .to_bits() which shifts the generation into the upper 32 bits. This seems to work fine, and I don't see why it would be a problem anywhere. Shout if you think of something.

I also tried FxHasher (from the rustc-hash crate) and SeaHasher as they seemed to be claimed to be the fastest hashers I could find, and neither are close to being the speed of past hash. As we are concerned about performance, it seems that the hashing cost of FxHasher or similar dominates performance versus the SIMD optimisation in hashbrown's hash map lookup. Maybe we can find something better in the future, but this seems to be fastest now. And I did also try .rotate_right(7) to shift the bottom 7 bits up to being the top 7 bits. That made performance seriously tank. Not sure why.

Copy link
Contributor

@SkiFire13 SkiFire13 Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could copy the 7 bottom bits to the top with a bitshift operation.

The problem with this approach is that the 7 top bits are used to differentiate values that end up in the same bucket (or other near buckets), but if they end up in the same bucket their very likely have the same bottom bits, and hence if you copy the 7 bottom bits to the 7 top bits they will be useless to distinguish different values.

Ideally you would copy the bottom 7 bits that aren't included in the bucket mask, but this depends on the size of the hashmap which is not a constant.

I also tried FxHasher (from the rustc-hash crate) and SeaHasher as they seemed to be claimed to be the fastest hashers I could find, and neither are close to being the speed of past hash. As we are concerned about performance, it seems that the hashing cost of FxHasher or similar dominates performance versus the SIMD optimisation in hashbrown's hash map lookup.

Did you try using just those hasher or my version which stores the hash in the top 32 bits only (the one with ((i as u64).wrapping_mul(0x517cc1b727220a95) << 32) | (i as u64))?

If you're using the hashers from those crates you may be observing the issue with memory locality I was mentioning previously. And if this is the issue I wonder how often it comes up in practice: this may just be a consequence of spawning lot of entities with the same component at the same moment. Maybe the entity distribution is much more random if lot of other systems are running, spawning/despawning entities and changing their archetypes.

Did you hash just the index or the generation too? Hashing the generation too takes more time (for example rustc_hash is just a multiplication to hash a single u32/u64, but to hash a second u32 it starts requiring a rotation, a xor and another multiplication)

And I did also try .rotate_right(7) to shift the bottom 7 bits up to being the top 7 bits. That made performance seriously tank. Not sure why.

If you .rotate_right(7) then you "lose" the bottom 7 bits, which makes every chunk of 128 entities map to the same bucket, which is horrible for performance. I think @nicopap meant doing something like ((i as u64) << 57) (i as u64) which makes the 7 top bits equal to the 7 bottom bits without losing them, but as I mentioned above this will likely be useless for the SIMD optimization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this PR as-is with Entity.to_bits() and PassHash, versus Skifire's suggestion of using only the index, not the generation, and calculating the hash as (i as u64).wrapping_mul(0x517cc1b727220a95) << 32) | (i as u64), and a suggestion from nicopap to try i | (i << (64 - 7)) as a cheap way of getting the low 7 bits into the high 7 bits.

Ordered z spawning

With entities spawned with ordered z (which means after sorting, the batching and main pass 2d are doing a linear scan in terms of entity spawn order):

PassHash (yellow) vs Skifire's (red) frame times:
Screenshot 2023-09-26 at 15 39 19

PassHash (yellow) vs nicopap's (red) frame times:
Screenshot 2023-09-26 at 15 42 50

PassHash mean time per call:
Screenshot 2023-09-26 at 15 47 38

Skifire's mean time per call:
Screenshot 2023-09-26 at 15 48 55

nicopap's mean time per call:
Screenshot 2023-09-26 at 15 49 13

nicopap's median frame time is 2.7% less than the PR, and Skifire's is 1.2%.
PR: queuing: 7.43ms, batching: 7ms, main_pass_2d: 0.288ms
nicopap's: queuing: 7.22ms, batching: 6.92ms, main_pass_2d: 0.293ms
Skifire's: queuing: 7.46ms, batching: 6.94ms, main_pass_2d: 0.276ms

So, for ordered, it looks like nicopap's suggestion is marginally better.

Random z spawning

This mode spawns the entities with random z, this causes their order after sorting to be random, so the lookups in batching and main pass 2d are random.

PassHash (yellow) vs Skifire's (red) frame times:
Screenshot 2023-09-26 at 15 58 16

PassHash (yellow) vs nicopap's (red) frame times:
Screenshot 2023-09-26 at 15 58 35

PassHash mean time per call:
Screenshot 2023-09-26 at 15 59 34

Skifire's mean time per call:
Screenshot 2023-09-26 at 15 59 50

nicopap's mean time per call:
Screenshot 2023-09-26 at 16 00 06

nicopap's median frame time is 2.7% less than the PR, and Skifire's is 3.8%.
PR: queuing: 7.46ms, batching: 25.85ms, main_pass_2d: 88.28ms
nicopap's: queuing: 7.36ms, batching: 24.39ms, main_pass_2d: 86.57ms
Skifire's: queuing: 7.44ms, batching: 23.88ms, main_pass_2d: 85.53ms

So for random z and random lookup orders, Skifire's suggestion is marginally faster.

I suspect Skifire's suggestion is going to be more robust generally, and as it has the best worst case (random access) performance, and is close for the best case (ordered access) performance, I'm inclined to choose it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SkiFire13 how exactly does one calculate / arrive at 0x517cc1b727220a95? As I had seen @nicopap's suggestion of using something golden ratio-based, I have seen doing things like u64::MAX / phi. But your constant seems to be u64::MAX / π. However, if I try to calculate it in rust code using (u64::MAX as f64 * std::f64::consts::FRAC_1_PI) as u64 or with various rounding, I can't get the same number. Is it that such a calculation was used as a starting point and then tests were done to identify that this particular value was good? I want to add a comment explaining its origin.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I don't know precisely how 0x517cc1b727220a95 was found, I just tweaked what rustc_hash does (which contains that exact constant), which in turn documents it is a port of Firefox's internal hasher. You're right though that is related to pi, it seems the number for which 2^64 / N is closest to pi.

Copy link
Member

@cart cart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally on board for this direction. But I agree that the current Entity passhash behavior we're relying on isn't ideal.

Fixes PassHash only using Entity.index.
@nicopap
Copy link
Contributor

nicopap commented Sep 26, 2023

many_foxes is broken on the current version of the PR.

@superdump
Copy link
Contributor Author

Yup, fixing.

Cargo.toml Outdated Show resolved Hide resolved
@nicopap
Copy link
Contributor

nicopap commented Sep 26, 2023

You could take the SkinIndices and MorphIndices as parameters to the impl GetBatchData for MeshPipleine type Param and check if they contain the entity in question. This would remove the serialization of the extract systems.

@@ -13,7 +14,7 @@ use std::{
/// For an identifier tied to the lifetime of an asset, see [`Handle`].
///
/// For an "untyped" / "generic-less" id, see [`UntypedAssetId`].
#[derive(Reflect)]
#[derive(Component, Reflect)]
Copy link
Contributor

@nicopap nicopap Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be a Component though. I'd prefer if it was newtyped

#[derive(Component)]
struct RenderAssetId<A: Asset>(AssetId<A>);

making this a component may be misleading for users, they might accidentally insert an AssetId when they mean to insert a Handle

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had discussed this with Cart and we thought it should become a component. The only reason I needed to change it was because ExtractComponent is implemented for Handle<T> and is supposed to extract to a Handle<T> at the moment, and we rather thought it should extract to an AssetId<T>.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it should be a newtype. I can see the the disadvantages but not the advantages (apart from expediency, I'd be glad to open a PR after this one is merged if that's the only problem). Maybe there is an advantage I'm missing. Could you link me to the conversation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't as it was in DMs on Discord. On the batching PR, Cart mentioned that AssetId<T> should be used if a weak handle is desired, such as how I was using it there. It felt appropriate to also switch to AssetId<T> as part of this PR.

I am happy to wrap AssetId<T> in a newtype, but I don't understand why I would do it. I hear that you think a user might insert an AssetId<T> when they should have used a Handle<T> but I don't understand why wrapping it in a newtype would make much difference to that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mentioned on Discord that it's easy for someone to see Handle and AssetId are both Components and think they can use either, whereas a newtype wrapper around AssetId gives a bit more friction. I decided to revert the change that made AssetId a Component and revert the change to the ExtractComponent impl for Handle. I think it's better taken in a separate PR. :) Thanks for noticing it!

@superdump
Copy link
Contributor Author

You could take the SkinIndices and MorphIndices as parameters to the impl GetBatchData for MeshPipleine type Param and check if they contain the entity in question. This would remove the serialization of the extract systems.

Right, that's what I meant would have a negative performance impact on the batching because it would have to do two additional lookups per mesh instance.

crates/bevy_render/src/batching/mod.rs Outdated Show resolved Hide resolved
crates/bevy_render/src/batching/mod.rs Show resolved Hide resolved
crates/bevy_render/src/batching/mod.rs Show resolved Hide resolved
&ViewVisibility,
&GlobalTransform,
&Mesh2dHandle,
Has<NoAutomaticBatching>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is NoAutomaticBatching added to a main world entity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this rhetorical and suggesting I should document it?

Main world entities that have morphs or skins will have it added. Otherwise users can choose to add it if they want to opt out of batching.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For mesh2d/mesh.rs here I don't think morphs/skins are relevant.

But what I was asking is: the idea is to make automatic batching user-controllable on an entity basis. But why? Isn't it a property of the specific material? If so, is a separate component justified? I mean, non-blocking, but worth thinking about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The component was introduced in the batching PR. The point was to allow people to opt-out of automatic batching and instancing if for some reason it was causing performance issues, or if they could do their own batching/instancing in some way that they want to. It's an escape hatch for user control.

crates/bevy_sprite/src/mesh2d/mesh.rs Outdated Show resolved Hide resolved
crates/bevy_pbr/src/render/mesh.rs Outdated Show resolved Hide resolved
@superdump superdump changed the title Use PassHashMap<Entity, T> for render world entity storage for better performance Use EntityHashMap<Entity, T> for render world entity storage for better performance Sep 26, 2023
Merged via the queue into bevyengine:main with commit b6ead2b Sep 27, 2023
21 checks passed
@rparrett rparrett mentioned this pull request Sep 29, 2023
pcwalton added a commit to pcwalton/bevy that referenced this pull request Oct 3, 2023
…sier for

other components to adopt.

Currently, the only way for custom components that participate in
rendering to opt into the higher-performance extraction method in bevyengine#9903
is to implement the `RenderInstances` data structure and the extraction
logic manually. This is inconvenient compared to the `ExtractComponent`
API.

This commit creates a new `ExtractToRenderInstance` trait that mirrors
the existing `ExtractComponent` method but uses the higher-performance
approach that bevyengine#9903 uses. This makes high-performance rendering
components essentially as easy to write as the existing ones based on
component extraction.
pcwalton added a commit to pcwalton/bevy that referenced this pull request Oct 3, 2023
…sier for

other components to adopt.

Currently, the only way for custom components that participate in
rendering to opt into the higher-performance extraction method in bevyengine#9903
is to implement the `RenderInstances` data structure and the extraction
logic manually. This is inconvenient compared to the `ExtractComponent`
API.

This commit creates a new `ExtractToRenderInstance` trait that mirrors
the existing `ExtractComponent` method but uses the higher-performance
approach that bevyengine#9903 uses. This makes high-performance rendering
components essentially as easy to write as the existing ones based on
component extraction.
pcwalton added a commit to pcwalton/bevy that referenced this pull request Oct 3, 2023
…sier for

other components to adopt.

Currently, the only way for custom components that participate in
rendering to opt into the higher-performance extraction method in bevyengine#9903
is to implement the `RenderInstances` data structure and the extraction
logic manually. This is inconvenient compared to the `ExtractComponent`
API.

This commit creates a new `ExtractToRenderInstance` trait that mirrors
the existing `ExtractComponent` method but uses the higher-performance
approach that bevyengine#9903 uses. This makes high-performance rendering
components essentially as easy to write as the existing ones based on
component extraction.
github-merge-queue bot pushed a commit that referenced this pull request Oct 6, 2023
# Objective

- After #9903, example
`mesh2d_manual` doesn't render anything

## Solution

- Fix the example using the new `RenderMesh2dInstances`
github-merge-queue bot pushed a commit that referenced this pull request Oct 8, 2023
…ther components to adopt. (#10002)

# Objective

Currently, the only way for custom components that participate in
rendering to opt into the higher-performance extraction method in #9903
is to implement the `RenderInstances` data structure and the extraction
logic manually. This is inconvenient compared to the `ExtractComponent`
API.

## Solution

This commit creates a new `RenderInstance` trait that mirrors the
existing `ExtractComponent` method but uses the higher-performance
approach that #9903 uses. Additionally, `RenderInstance` is more
flexible than `ExtractComponent`, because it can extract multiple
components at once. This makes high-performance rendering components
essentially as easy to write as the existing ones based on component
extraction.

---

## Changelog

### Added

A new `RenderInstance` trait is available mirroring `ExtractComponent`,
but using a higher-performance method to extract one or more components
to the render world. If you have custom components that rendering takes
into account, you may consider migration from `ExtractComponent` to
`RenderInstance` for higher performance.
@cart cart mentioned this pull request Oct 13, 2023
43 tasks
regnarock pushed a commit to regnarock/bevy that referenced this pull request Oct 13, 2023
# Objective

- After bevyengine#9903, example
`mesh2d_manual` doesn't render anything

## Solution

- Fix the example using the new `RenderMesh2dInstances`
regnarock pushed a commit to regnarock/bevy that referenced this pull request Oct 13, 2023
…sier for other components to adopt. (bevyengine#10002)

# Objective

Currently, the only way for custom components that participate in
rendering to opt into the higher-performance extraction method in bevyengine#9903
is to implement the `RenderInstances` data structure and the extraction
logic manually. This is inconvenient compared to the `ExtractComponent`
API.

## Solution

This commit creates a new `RenderInstance` trait that mirrors the
existing `ExtractComponent` method but uses the higher-performance
approach that bevyengine#9903 uses. Additionally, `RenderInstance` is more
flexible than `ExtractComponent`, because it can extract multiple
components at once. This makes high-performance rendering components
essentially as easy to write as the existing ones based on component
extraction.

---

## Changelog

### Added

A new `RenderInstance` trait is available mirroring `ExtractComponent`,
but using a higher-performance method to extract one or more components
to the render world. If you have custom components that rendering takes
into account, you may consider migration from `ExtractComponent` to
`RenderInstance` for higher performance.
ameknite pushed a commit to ameknite/bevy that referenced this pull request Nov 6, 2023
# Objective

- After bevyengine#9903, example
`mesh2d_manual` doesn't render anything

## Solution

- Fix the example using the new `RenderMesh2dInstances`
ameknite pushed a commit to ameknite/bevy that referenced this pull request Nov 6, 2023
…sier for other components to adopt. (bevyengine#10002)

# Objective

Currently, the only way for custom components that participate in
rendering to opt into the higher-performance extraction method in bevyengine#9903
is to implement the `RenderInstances` data structure and the extraction
logic manually. This is inconvenient compared to the `ExtractComponent`
API.

## Solution

This commit creates a new `RenderInstance` trait that mirrors the
existing `ExtractComponent` method but uses the higher-performance
approach that bevyengine#9903 uses. Additionally, `RenderInstance` is more
flexible than `ExtractComponent`, because it can extract multiple
components at once. This makes high-performance rendering components
essentially as easy to write as the existing ones based on component
extraction.

---

## Changelog

### Added

A new `RenderInstance` trait is available mirroring `ExtractComponent`,
but using a higher-performance method to extract one or more components
to the render world. If you have custom components that rendering takes
into account, you may consider migration from `ExtractComponent` to
`RenderInstance` for higher performance.
github-merge-queue bot pushed a commit that referenced this pull request Nov 9, 2023
# Objective

Related to #10472.

Not having a hardcoded scale factor makes comparing results from these
stress tests difficult.

Contributors using high dpi screens may be rendering 4x as many pixels
as others (or more). Stress tests may have different behavior when moved
from one monitor in a dual setup to another. At very high resolutions,
different parts of the engine / hardware are being stressed.

1080p is also a far more common resolution for gaming.

## Solution

Use a consistent 1080p with `scale_factor_override: 1.0` everywhere.

In #9903, this sort of change was added specifically to `bevymark` and
`many_cubes` but it makes sense to do it everywhere.

## Discussion

- Maybe we should have a command line option, environment variable, or
`CI_TESTING_CONFIG` option for 1080p / 1440p / 4k.

- Will these look odd (small text?) when screenshotted and shown in the
example showcase? The aspect ratio is the same, but they will be
downscaled from 1080p instead of ~720p.

- Maybe there are other window properties that should be consistent
across stress tests. e.g. `resizable: false`.

- Should we add a `stress_test_window(title)` helper or something?

- Bevymark (pre-10472) was intentionally 800x600 to match "bunnymark", I
believe. I don't personally think this is very important.
Shatur added a commit to projectharmonia/bevy_replicon that referenced this pull request Nov 24, 2023
Shatur added a commit to projectharmonia/bevy_replicon that referenced this pull request Nov 24, 2023
Shatur added a commit to projectharmonia/bevy_replicon that referenced this pull request Nov 26, 2023
github-merge-queue bot pushed a commit that referenced this pull request Nov 28, 2023
# Objective

Keep essentially the same structure of `EntityHasher` from #9903, but
rephrase the multiplication slightly to save an instruction.

cc @superdump 
Discord thread:
https://discord.com/channels/691052431525675048/1172033156845674507/1174969772522356756

## Solution

Today, the hash is
```rust
        self.hash = i | (i.wrapping_mul(FRAC_U64MAX_PI) << 32);
```
with `i` being `(generation << 32) | index`.

Expanding things out, we get
```rust
i | ( (i * CONST) << 32 )
= (generation << 32) | index | ((((generation << 32) | index) * CONST) << 32)
= (generation << 32) | index | ((index * CONST) << 32)  // because the generation overflowed
= (index * CONST | generation) << 32 | index
```

What if we do the same thing, but with `+` instead of `|`? That's almost
the same thing, except that it has carries, which are actually often
better in a hash function anyway, since it doesn't saturate. (`|` can be
dangerous, since once something becomes `-1` it'll stay that, and
there's no mixing available.)

```rust
(index * CONST + generation) << 32 + index
= (CONST << 32 + 1) * index + generation << 32
= (CONST << 32 + 1) * index + (WHATEVER << 32 + generation) << 32 // because the extra overflows and thus can be anything
= (CONST << 32 + 1) * index + ((CONST * generation) << 32 + generation) << 32 // pick "whatever" to be something convenient
= (CONST << 32 + 1) * index + ((CONST << 32 + 1) * generation) << 32
= (CONST << 32 + 1) * index +((CONST << 32 + 1) * (generation << 32)
= (CONST << 32 + 1) * (index + generation << 32)
= (CONST << 32 + 1) * (generation << 32 | index)
= (CONST << 32 + 1) * i
```

So we can do essentially the same thing using a single multiplication
instead of doing multiply-shift-or.

LLVM was already smart enough to merge the shifting into a
multiplication, but this saves the extra `or`:

![image](https:/bevyengine/bevy/assets/18526288/d9396614-2326-4730-abbe-4908c01b5ace)
<https://rust.godbolt.org/z/MEvbz4eo4>

It's a very small change, and often will disappear in load latency
anyway, but it's a couple percent faster in lookups:

![image](https:/bevyengine/bevy/assets/18526288/c365ec85-6adc-4f6d-8fa6-a65146f55a75)

(There was more of an improvement here before #10558, but with `to_bits`
being a single `qword` load now, keeping things mostly as it is turned
out to be better than the bigger changes I'd tried in #10605.)

---

## Changelog

(Probably skip it)

## Migration Guide

(none needed)
rdrpenguin04 pushed a commit to rdrpenguin04/bevy that referenced this pull request Jan 9, 2024
…er performance (bevyengine#9903)

# Objective

- Improve rendering performance, particularly by avoiding the large
system commands costs of using the ECS in the way that the render world
does.

## Solution

- Define `EntityHasher` that calculates a hash from the
`Entity.to_bits()` by `i | (i.wrapping_mul(0x517cc1b727220a95) << 32)`.
`0x517cc1b727220a95` is something like `u64::MAX / N` for N that gives a
value close to π and that works well for hashing. Thanks for @SkiFire13
for the suggestion and to @nicopap for alternative suggestions and
discussion. This approach comes from `rustc-hash` (a.k.a. `FxHasher`)
with some tweaks for the case of hashing an `Entity`. `FxHasher` and
`SeaHasher` were also tested but were significantly slower.
- Define `EntityHashMap` type that uses the `EntityHashser`
- Use `EntityHashMap<Entity, T>` for render world entity storage,
including:
- `RenderMaterialInstances` - contains the `AssetId<M>` of the material
associated with the entity. Also for 2D.
- `RenderMeshInstances` - contains mesh transforms, flags and properties
about mesh entities. Also for 2D.
- `SkinIndices` and `MorphIndices` - contains the skin and morph index
for an entity, respectively
  - `ExtractedSprites`
  - `ExtractedUiNodes`

## Benchmarks

All benchmarks have been conducted on an M1 Max connected to AC power.
The tests are run for 1500 frames. The 1000th frame is captured for
comparison to check for visual regressions. There were none.

### 2D Meshes

`bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d`

#### `--ordered-z`

This test spawns the 2D meshes with z incrementing back to front, which
is the ideal arrangement allocation order as it matches the sorted
render order which means lookups have a high cache hit rate.

<img width="1112" alt="Screenshot 2023-09-27 at 07 50 45"
src="https:/bevyengine/bevy/assets/302146/e140bc98-7091-4a3b-8ae1-ab75d16d2ccb">

-39.1% median frame time.

#### Random

This test spawns the 2D meshes with random z. This not only makes the
batching and transparent 2D pass lookups get a lot of cache misses, it
also currently means that the meshes are almost certain to not be
batchable.

<img width="1108" alt="Screenshot 2023-09-27 at 07 51 28"
src="https:/bevyengine/bevy/assets/302146/29c2e813-645a-43ce-982a-55df4bf7d8c4">

-7.2% median frame time.

### 3D Meshes

`many_cubes --benchmark`

<img width="1112" alt="Screenshot 2023-09-27 at 07 51 57"
src="https:/bevyengine/bevy/assets/302146/1a729673-3254-4e2a-9072-55e27c69f0fc">

-7.7% median frame time.

### Sprites

**NOTE: On `main` sprites are using `SparseSet<Entity, T>`!**

`bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite`

#### `--ordered-z`

This test spawns the sprites with z incrementing back to front, which is
the ideal arrangement allocation order as it matches the sorted render
order which means lookups have a high cache hit rate.

<img width="1116" alt="Screenshot 2023-09-27 at 07 52 31"
src="https:/bevyengine/bevy/assets/302146/bc8eab90-e375-4d31-b5cd-f55f6f59ab67">

+13.0% median frame time.

#### Random

This test spawns the sprites with random z. This makes the batching and
transparent 2D pass lookups get a lot of cache misses.

<img width="1109" alt="Screenshot 2023-09-27 at 07 53 01"
src="https:/bevyengine/bevy/assets/302146/22073f5d-99a7-49b0-9584-d3ac3eac3033">

+0.6% median frame time.

### UI

**NOTE: On `main` UI is using `SparseSet<Entity, T>`!**

`many_buttons`

<img width="1111" alt="Screenshot 2023-09-27 at 07 53 26"
src="https:/bevyengine/bevy/assets/302146/66afd56d-cbe4-49e7-8b64-2f28f6043d85">

+15.1% median frame time.

## Alternatives

- Cart originally suggested trying out `SparseSet<Entity, T>` and indeed
that is slightly faster under ideal conditions. However,
`PassHashMap<Entity, T>` has better worst case performance when data is
randomly distributed, rather than in sorted render order, and does not
have the worst case memory usage that `SparseSet`'s dense `Vec<usize>`
that maps from the `Entity` index to sparse index into `Vec<T>`. This
dense `Vec` has to be as large as the largest Entity index used with the
`SparseSet`.
- I also tested `PassHashMap<u32, T>`, intending to use `Entity.index()`
as the key, but this proved to sometimes be slower and mostly no
different.
- The only outstanding approach that has not been implemented and tested
is to _not_ clear the render world of its entities each frame. That has
its own problems, though they could perhaps be solved.
- Performance-wise, if the entities and their component data were not
cleared, then they would incur table moves on spawn, and should not
thereafter, rather just their component data would be overwritten.
Ideally we would have a neat way of either updating data in-place via
`&mut T` queries, or inserting components if not present. This would
likely be quite cumbersome to have to remember to do everywhere, but
perhaps it only needs to be done in the more performance-sensitive
systems.
- The main problem to solve however is that we want to both maintain a
mapping between main world entities and render world entities, be able
to run the render app and world in parallel with the main app and world
for pipelined rendering, and at the same time be able to spawn entities
in the render world in such a way that those Entity ids do not collide
with those spawned in the main world. This is potentially quite
solvable, but could well be a lot of ECS work to do it in a way that
makes sense.

---

## Changelog

- Changed: Component data for entities to be drawn are no longer stored
on entities in the render world. Instead, data is stored in a
`EntityHashMap<Entity, T>` in various resources. This brings significant
performance benefits due to the way the render app clears entities every
frame. Resources of most interest are `RenderMeshInstances` and
`RenderMaterialInstances`, and their 2D counterparts.

## Migration Guide

Previously the render app extracted mesh entities and their component
data from the main world and stored them as entities and components in
the render world. Now they are extracted into essentially
`EntityHashMap<Entity, T>` where `T` are structs containing an
appropriate group of data. This means that while extract set systems
will continue to run extract queries against the main world they will
store their data in hash maps. Also, systems in later sets will either
need to look up entities in the available resources such as
`RenderMeshInstances`, or maintain their own `EntityHashMap<Entity, T>`
for their own data.

Before:
```rust
fn queue_custom(
    material_meshes: Query<(Entity, &MeshTransforms, &Handle<Mesh>), With<InstanceMaterialData>>,
) {
    ...
    for (entity, mesh_transforms, mesh_handle) in &material_meshes {
        ...
    }
}
```

After:
```rust
fn queue_custom(
    render_mesh_instances: Res<RenderMeshInstances>,
    instance_entities: Query<Entity, With<InstanceMaterialData>>,
) {
    ...
    for entity in &instance_entities {
        let Some(mesh_instance) = render_mesh_instances.get(&entity) else { continue; };
        // The mesh handle in `AssetId<Mesh>` form, and the `MeshTransforms` can now
        // be found in `mesh_instance` which is a `RenderMeshInstance`
        ...
    }
}
```

---------

Co-authored-by: robtfm <[email protected]>
rdrpenguin04 pushed a commit to rdrpenguin04/bevy that referenced this pull request Jan 9, 2024
# Objective

- After bevyengine#9903, example
`mesh2d_manual` doesn't render anything

## Solution

- Fix the example using the new `RenderMesh2dInstances`
rdrpenguin04 pushed a commit to rdrpenguin04/bevy that referenced this pull request Jan 9, 2024
…sier for other components to adopt. (bevyengine#10002)

# Objective

Currently, the only way for custom components that participate in
rendering to opt into the higher-performance extraction method in bevyengine#9903
is to implement the `RenderInstances` data structure and the extraction
logic manually. This is inconvenient compared to the `ExtractComponent`
API.

## Solution

This commit creates a new `RenderInstance` trait that mirrors the
existing `ExtractComponent` method but uses the higher-performance
approach that bevyengine#9903 uses. Additionally, `RenderInstance` is more
flexible than `ExtractComponent`, because it can extract multiple
components at once. This makes high-performance rendering components
essentially as easy to write as the existing ones based on component
extraction.

---

## Changelog

### Added

A new `RenderInstance` trait is available mirroring `ExtractComponent`,
but using a higher-performance method to extract one or more components
to the render world. If you have custom components that rendering takes
into account, you may consider migration from `ExtractComponent` to
`RenderInstance` for higher performance.
rdrpenguin04 pushed a commit to rdrpenguin04/bevy that referenced this pull request Jan 9, 2024
…ine#10474)

# Objective

Related to bevyengine#10472.

Not having a hardcoded scale factor makes comparing results from these
stress tests difficult.

Contributors using high dpi screens may be rendering 4x as many pixels
as others (or more). Stress tests may have different behavior when moved
from one monitor in a dual setup to another. At very high resolutions,
different parts of the engine / hardware are being stressed.

1080p is also a far more common resolution for gaming.

## Solution

Use a consistent 1080p with `scale_factor_override: 1.0` everywhere.

In bevyengine#9903, this sort of change was added specifically to `bevymark` and
`many_cubes` but it makes sense to do it everywhere.

## Discussion

- Maybe we should have a command line option, environment variable, or
`CI_TESTING_CONFIG` option for 1080p / 1440p / 4k.

- Will these look odd (small text?) when screenshotted and shown in the
example showcase? The aspect ratio is the same, but they will be
downscaled from 1080p instead of ~720p.

- Maybe there are other window properties that should be consistent
across stress tests. e.g. `resizable: false`.

- Should we add a `stress_test_window(title)` helper or something?

- Bevymark (pre-10472) was intentionally 800x600 to match "bunnymark", I
believe. I don't personally think this is very important.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants