Use EntityHashMap<Entity, T> for render world entity storage for better performance #9903

superdump · 2023-09-22T22:25:49Z

Objective

Improve rendering performance, particularly by avoiding the large system commands costs of using the ECS in the way that the render world does.

Solution

Define EntityHasher that calculates a hash from the Entity.to_bits() by i | (i.wrapping_mul(0x517cc1b727220a95) << 32). 0x517cc1b727220a95 is something like u64::MAX / N for N that gives a value close to π and that works well for hashing. Thanks for @SkiFire13 for the suggestion and to @nicopap for alternative suggestions and discussion. This approach comes from rustc-hash (a.k.a. FxHasher) with some tweaks for the case of hashing an Entity. FxHasher and SeaHasher were also tested but were significantly slower.
Define EntityHashMap type that uses the EntityHashser
Use EntityHashMap<Entity, T> for render world entity storage, including:
- RenderMaterialInstances - contains the AssetId<M> of the material associated with the entity. Also for 2D.
- RenderMeshInstances - contains mesh transforms, flags and properties about mesh entities. Also for 2D.
- SkinIndices and MorphIndices - contains the skin and morph index for an entity, respectively
- ExtractedSprites
- ExtractedUiNodes

Benchmarks

All benchmarks have been conducted on an M1 Max connected to AC power. The tests are run for 1500 frames. The 1000th frame is captured for comparison to check for visual regressions. There were none.

2D Meshes

bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d

`--ordered-z`

This test spawns the 2D meshes with z incrementing back to front, which is the ideal arrangement allocation order as it matches the sorted render order which means lookups have a high cache hit rate.

-39.1% median frame time.

Random

This test spawns the 2D meshes with random z. This not only makes the batching and transparent 2D pass lookups get a lot of cache misses, it also currently means that the meshes are almost certain to not be batchable.

-7.2% median frame time.

3D Meshes

many_cubes --benchmark

-7.7% median frame time.

Sprites

NOTE: On main sprites are using SparseSet<Entity, T>!

bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite

`--ordered-z`

This test spawns the sprites with z incrementing back to front, which is the ideal arrangement allocation order as it matches the sorted render order which means lookups have a high cache hit rate.

+13.0% median frame time.

Random

This test spawns the sprites with random z. This makes the batching and transparent 2D pass lookups get a lot of cache misses.

+0.6% median frame time.

UI

NOTE: On main UI is using SparseSet<Entity, T>!

many_buttons

+15.1% median frame time.

Alternatives

Cart originally suggested trying out SparseSet<Entity, T> and indeed that is slightly faster under ideal conditions. However, PassHashMap<Entity, T> has better worst case performance when data is randomly distributed, rather than in sorted render order, and does not have the worst case memory usage that SparseSet's dense Vec<usize> that maps from the Entity index to sparse index into Vec<T>. This dense Vec has to be as large as the largest Entity index used with the SparseSet.
I also tested PassHashMap<u32, T>, intending to use Entity.index() as the key, but this proved to sometimes be slower and mostly no different.
The only outstanding approach that has not been implemented and tested is to not clear the render world of its entities each frame. That has its own problems, though they could perhaps be solved.
- Performance-wise, if the entities and their component data were not cleared, then they would incur table moves on spawn, and should not thereafter, rather just their component data would be overwritten. Ideally we would have a neat way of either updating data in-place via &mut T queries, or inserting components if not present. This would likely be quite cumbersome to have to remember to do everywhere, but perhaps it only needs to be done in the more performance-sensitive systems.
- The main problem to solve however is that we want to both maintain a mapping between main world entities and render world entities, be able to run the render app and world in parallel with the main app and world for pipelined rendering, and at the same time be able to spawn entities in the render world in such a way that those Entity ids do not collide with those spawned in the main world. This is potentially quite solvable, but could well be a lot of ECS work to do it in a way that makes sense.

Changelog

Changed: Component data for entities to be drawn are no longer stored on entities in the render world. Instead, data is stored in a EntityHashMap<Entity, T> in various resources. This brings significant performance benefits due to the way the render app clears entities every frame. Resources of most interest are RenderMeshInstances and RenderMaterialInstances, and their 2D counterparts.

Migration Guide

Previously the render app extracted mesh entities and their component data from the main world and stored them as entities and components in the render world. Now they are extracted into essentially EntityHashMap<Entity, T> where T are structs containing an appropriate group of data. This means that while extract set systems will continue to run extract queries against the main world they will store their data in hash maps. Also, systems in later sets will either need to look up entities in the available resources such as RenderMeshInstances, or maintain their own EntityHashMap<Entity, T> for their own data.

Before:

fn queue_custom(
    material_meshes: Query<(Entity, &MeshTransforms, &Handle<Mesh>), With<InstanceMaterialData>>,
) {
    ...
    for (entity, mesh_transforms, mesh_handle) in &material_meshes {
        ...
    }
}

After:

fn queue_custom(
    render_mesh_instances: Res<RenderMeshInstances>,
    instance_entities: Query<Entity, With<InstanceMaterialData>>,
) {
    ...
    for entity in &instance_entities {
        let Some(mesh_instance) = render_mesh_instances.get(&entity) else { continue; };
        // The mesh handle in `AssetId<Mesh>` form, and the `MeshTransforms` can now
        // be found in `mesh_instance` which is a `RenderMeshInstance`
        ...
    }
}

… performance

superdump · 2023-09-22T22:28:52Z

@cart just to clearly state that this PR does introduce performance regressions for sprites and UI as it switches them from SparseSet to PassHashMap. The question is - do we want to take a best case performance hit for better performance with random ordering and avoiding memory overuse?

JMS55 · 2023-09-23T00:24:09Z

These benchmarks are likely to change once we switch the render phase sorting to be by pipeline, right?

superdump · 2023-09-23T06:45:51Z

These benchmarks are likely to change once we switch the render phase sorting to be by pipeline, right?

These benchmarks in particular wouldn’t change as they’re all using one pipeline each. Beyond that, we would only implement that kind of sorting for opaque phases where draw order is about maximising performance and not about visual correctness. That means only many_cubes would be affected as 2D and UI only have transparent phases at the moment.

SkiFire13 · 2023-09-23T08:28:23Z

crates/bevy_utils/src/lib.rs

+ #[inline]
+ fn write_u32(&mut self, i: u32) {
+ self.hash = i as u64;
+ }
+


This is probably a performance footgun. hashbrown uses the top 7 bits of the hash to do SIMD probing inside a group, and this will set all of them to 0, essentially nullifying that optimization.

Also note that Entity has two u32 fields, so it will call write_u32 twice and the second one will overwrite the result of the first one. Luckily the index is the second field of Entity, but we're not documenting this anywhere and it would be easy to break (especially considering the ongoing work to refactor identifiers in general).

Ah, I didn't know it worked like that. What do you suggest?

The generation in the Entity doesn't really matter for these uses, so I did try using Entity.index() -> u32 as noted in the description but in some cases it was slower than using Entity for some reason. Just noting that the index is the bit I'm interested in using as a key for cases where the map is cleared each frame, in case that makes things simpler to improve.

Ah, I didn't know it worked like that. What do you suggest?

You could try with a very simple hasher like rustc_hash. If you want to manually implement it for the u32 case, it's just (i as u64).wrapping_mul(0x517cc1b727220a95). The only concern I have with this is that it kinda randomizes the hash, so it pessimizes the case where entities are sorted.

An alternative for better memory locality might be keeping the entity index for the lower 32 bits while using the hash for the upper 32 bits, which should allow some useful top 7 bits. This could be implemented with ((i as u64).wrapping_mul(0x517cc1b727220a95) << 32) | (i as u64).
For 32 bit targets (e.g. wasm) this is a bit more complicated because hashbrown only considers the bottom 32 bits. Unfortunately the top 7 bits of the entity index are all zeros most of the time (you need 2^25 ~ 33M entities for them to start being non-zero). My guess is that you could get away with the previous method but done on a 16-16 bits split, e.g. (i.wrapping_mul(0x9e3779b9) << 16) | (i & 0xffff)

Just noting that the index is the bit I'm interested in using as a key for cases where the map is cleared each frame, in case that makes things simpler to improve.

Then using only index should be fine for your uses, but how do you ensure that whoever changes Entity in the future makes sure that the index field will be hashed last?

Yeah I don't like how implicit this is. PassHasher exists to pass a single full u64 hash. Giving it support for a u32 and then using it on each of the fields of Entity is a confusing misuse of that design imo.

Some ideas:

A custom Entity Hash impl that combines the index and generation into a single u64

Probably very bad for some safety reason (in addition to cleanliness / maintainability), but could we transmute &self (&Entity) to &64?

A custom EntityMap type that does the work for users. (ex: if we really want a entity.index->X map, we could just wrap an Entity api over PassHashMap<u64, X, PassHash> that casts entity.index to u64.

I tried a suggestion from @nicopap about doing a wrapping_mul by u32::MAX / golden ratio. I was thinking that as the hash is a u64, I should instead use u64::MAX / phi. But this made performance really really bad. Not sure what I did wrong. And ahash and others are not as bad as that was, which seems odd. My current guess without having had time to dig deeper is that wrapping_mul on a u64 uses u128 maths that just has really bad performance.

@cart options 1 and 2 don’t address skifire’s comment about the upper 7 bits being used for some simd optimisation. Option 2 would always have the upper 32 bits as all zeros, and option 1 would be mostly zeros unless people do a lot of respawning of entities.

I removed write_u32, and implemented Hash for Entity manually by using .to_bits() which shifts the generation into the upper 32 bits. This seems to work fine, and I don't see why it would be a problem anywhere. Shout if you think of something.

I also tried FxHasher (from the rustc-hash crate) and SeaHasher as they seemed to be claimed to be the fastest hashers I could find, and neither are close to being the speed of past hash. As we are concerned about performance, it seems that the hashing cost of FxHasher or similar dominates performance versus the SIMD optimisation in hashbrown's hash map lookup. Maybe we can find something better in the future, but this seems to be fastest now. And I did also try .rotate_right(7) to shift the bottom 7 bits up to being the top 7 bits. That made performance seriously tank. Not sure why.

You could copy the 7 bottom bits to the top with a bitshift operation.

The problem with this approach is that the 7 top bits are used to differentiate values that end up in the same bucket (or other near buckets), but if they end up in the same bucket their very likely have the same bottom bits, and hence if you copy the 7 bottom bits to the 7 top bits they will be useless to distinguish different values.

Ideally you would copy the bottom 7 bits that aren't included in the bucket mask, but this depends on the size of the hashmap which is not a constant.

I also tried FxHasher (from the rustc-hash crate) and SeaHasher as they seemed to be claimed to be the fastest hashers I could find, and neither are close to being the speed of past hash. As we are concerned about performance, it seems that the hashing cost of FxHasher or similar dominates performance versus the SIMD optimisation in hashbrown's hash map lookup.

Did you try using just those hasher or my version which stores the hash in the top 32 bits only (the one with ((i as u64).wrapping_mul(0x517cc1b727220a95) << 32) | (i as u64))?

If you're using the hashers from those crates you may be observing the issue with memory locality I was mentioning previously. And if this is the issue I wonder how often it comes up in practice: this may just be a consequence of spawning lot of entities with the same component at the same moment. Maybe the entity distribution is much more random if lot of other systems are running, spawning/despawning entities and changing their archetypes.

Did you hash just the index or the generation too? Hashing the generation too takes more time (for example rustc_hash is just a multiplication to hash a single u32/u64, but to hash a second u32 it starts requiring a rotation, a xor and another multiplication)

And I did also try .rotate_right(7) to shift the bottom 7 bits up to being the top 7 bits. That made performance seriously tank. Not sure why.

If you .rotate_right(7) then you "lose" the bottom 7 bits, which makes every chunk of 128 entities map to the same bucket, which is horrible for performance. I think @nicopap meant doing something like ((i as u64) << 57) (i as u64) which makes the 7 top bits equal to the 7 bottom bits without losing them, but as I mentioned above this will likely be useless for the SIMD optimization.

I tested this PR as-is with Entity.to_bits() and PassHash, versus Skifire's suggestion of using only the index, not the generation, and calculating the hash as (i as u64).wrapping_mul(0x517cc1b727220a95) << 32) | (i as u64), and a suggestion from nicopap to try i | (i << (64 - 7)) as a cheap way of getting the low 7 bits into the high 7 bits.

Ordered z spawning

With entities spawned with ordered z (which means after sorting, the batching and main pass 2d are doing a linear scan in terms of entity spawn order):

PassHash (yellow) vs Skifire's (red) frame times:

PassHash (yellow) vs nicopap's (red) frame times:

PassHash mean time per call:

Skifire's mean time per call:

nicopap's mean time per call:

nicopap's median frame time is 2.7% less than the PR, and Skifire's is 1.2%.
PR: queuing: 7.43ms, batching: 7ms, main_pass_2d: 0.288ms
nicopap's: queuing: 7.22ms, batching: 6.92ms, main_pass_2d: 0.293ms
Skifire's: queuing: 7.46ms, batching: 6.94ms, main_pass_2d: 0.276ms

So, for ordered, it looks like nicopap's suggestion is marginally better.

Random z spawning

This mode spawns the entities with random z, this causes their order after sorting to be random, so the lookups in batching and main pass 2d are random.

PassHash (yellow) vs Skifire's (red) frame times:

PassHash (yellow) vs nicopap's (red) frame times:

PassHash mean time per call:

Skifire's mean time per call:

nicopap's mean time per call:

nicopap's median frame time is 2.7% less than the PR, and Skifire's is 3.8%.
PR: queuing: 7.46ms, batching: 25.85ms, main_pass_2d: 88.28ms
nicopap's: queuing: 7.36ms, batching: 24.39ms, main_pass_2d: 86.57ms
Skifire's: queuing: 7.44ms, batching: 23.88ms, main_pass_2d: 85.53ms

So for random z and random lookup orders, Skifire's suggestion is marginally faster.

I suspect Skifire's suggestion is going to be more robust generally, and as it has the best worst case (random access) performance, and is close for the best case (ordered access) performance, I'm inclined to choose it.

@SkiFire13 how exactly does one calculate / arrive at 0x517cc1b727220a95? As I had seen @nicopap's suggestion of using something golden ratio-based, I have seen doing things like u64::MAX / phi. But your constant seems to be u64::MAX / π. However, if I try to calculate it in rust code using (u64::MAX as f64 * std::f64::consts::FRAC_1_PI) as u64 or with various rounding, I can't get the same number. Is it that such a calculation was used as a starting point and then tests were done to identify that this particular value was good? I want to add a comment explaining its origin.

Unfortunately I don't know precisely how 0x517cc1b727220a95 was found, I just tweaked what rustc_hash does (which contains that exact constant), which in turn documents it is a port of Firefox's internal hasher. You're right though that is related to pi, it seems the number for which 2^64 / N is closest to pi.

cart

Generally on board for this direction. But I agree that the current Entity passhash behavior we're relying on isn't ideal.

Fixes PassHash only using Entity.index.

nicopap · 2023-09-26T11:08:31Z

many_foxes is broken on the current version of the PR.

superdump · 2023-09-26T11:12:05Z

Yup, fixing.

crates/bevy_pbr/src/render/mesh.rs

Cargo.toml

nicopap · 2023-09-26T11:29:45Z

You could take the SkinIndices and MorphIndices as parameters to the impl GetBatchData for MeshPipleine type Param and check if they contain the entity in question. This would remove the serialization of the extract systems.

nicopap · 2023-09-26T11:34:13Z

crates/bevy_asset/src/id.rs

@@ -13,7 +14,7 @@ use std::{
 /// For an identifier tied to the lifetime of an asset, see [`Handle`].
 ///
 /// For an "untyped" / "generic-less" id, see [`UntypedAssetId`].
-#[derive(Reflect)]
+#[derive(Component, Reflect)]


This shouldn't be a Component though. I'd prefer if it was newtyped

#[derive(Component)] struct RenderAssetId<A: Asset>(AssetId<A>);

making this a component may be misleading for users, they might accidentally insert an AssetId when they mean to insert a Handle

I had discussed this with Cart and we thought it should become a component. The only reason I needed to change it was because ExtractComponent is implemented for Handle<T> and is supposed to extract to a Handle<T> at the moment, and we rather thought it should extract to an AssetId<T>.

I still think it should be a newtype. I can see the the disadvantages but not the advantages (apart from expediency, I'd be glad to open a PR after this one is merged if that's the only problem). Maybe there is an advantage I'm missing. Could you link me to the conversation?

I can't as it was in DMs on Discord. On the batching PR, Cart mentioned that AssetId<T> should be used if a weak handle is desired, such as how I was using it there. It felt appropriate to also switch to AssetId<T> as part of this PR.

I am happy to wrap AssetId<T> in a newtype, but I don't understand why I would do it. I hear that you think a user might insert an AssetId<T> when they should have used a Handle<T> but I don't understand why wrapping it in a newtype would make much difference to that.

You mentioned on Discord that it's easy for someone to see Handle and AssetId are both Components and think they can use either, whereas a newtype wrapper around AssetId gives a bit more friction. I decided to revert the change that made AssetId a Component and revert the change to the ExtractComponent impl for Handle. I think it's better taken in a separate PR. :) Thanks for noticing it!

superdump · 2023-09-26T11:45:26Z

You could take the SkinIndices and MorphIndices as parameters to the impl GetBatchData for MeshPipleine type Param and check if they contain the entity in question. This would remove the serialization of the extract systems.

Right, that's what I meant would have a negative performance impact on the batching because it would have to do two additional lookups per mesh instance.

crates/bevy_render/src/batching/mod.rs

nicopap · 2023-09-26T12:05:37Z

crates/bevy_sprite/src/mesh2d/mesh.rs

+ &ViewVisibility,
+ &GlobalTransform,
+ &Mesh2dHandle,
+ Has<NoAutomaticBatching>,


When is NoAutomaticBatching added to a main world entity?

Is this rhetorical and suggesting I should document it?

Main world entities that have morphs or skins will have it added. Otherwise users can choose to add it if they want to opt out of batching.

For mesh2d/mesh.rs here I don't think morphs/skins are relevant.

But what I was asking is: the idea is to make automatic batching user-controllable on an entity basis. But why? Isn't it a property of the specific material? If so, is a separate component justified? I mean, non-blocking, but worth thinking about.

The component was introduced in the batching PR. The point was to allow people to opt-out of automatic batching and instancing if for some reason it was causing performance issues, or if they could do their own batching/instancing in some way that they want to. It's an escape hatch for user control.

crates/bevy_sprite/src/mesh2d/mesh.rs

crates/bevy_pbr/src/render/mesh.rs

…ead of id()

…sier for other components to adopt. Currently, the only way for custom components that participate in rendering to opt into the higher-performance extraction method in bevyengine#9903 is to implement the `RenderInstances` data structure and the extraction logic manually. This is inconvenient compared to the `ExtractComponent` API. This commit creates a new `ExtractToRenderInstance` trait that mirrors the existing `ExtractComponent` method but uses the higher-performance approach that bevyengine#9903 uses. This makes high-performance rendering components essentially as easy to write as the existing ones based on component extraction.

# Objective - After #9903, example `mesh2d_manual` doesn't render anything ## Solution - Fix the example using the new `RenderMesh2dInstances`

…ther components to adopt. (#10002) # Objective Currently, the only way for custom components that participate in rendering to opt into the higher-performance extraction method in #9903 is to implement the `RenderInstances` data structure and the extraction logic manually. This is inconvenient compared to the `ExtractComponent` API. ## Solution This commit creates a new `RenderInstance` trait that mirrors the existing `ExtractComponent` method but uses the higher-performance approach that #9903 uses. Additionally, `RenderInstance` is more flexible than `ExtractComponent`, because it can extract multiple components at once. This makes high-performance rendering components essentially as easy to write as the existing ones based on component extraction. --- ## Changelog ### Added A new `RenderInstance` trait is available mirroring `ExtractComponent`, but using a higher-performance method to extract one or more components to the render world. If you have custom components that rendering takes into account, you may consider migration from `ExtractComponent` to `RenderInstance` for higher performance.

# Objective - After bevyengine#9903, example `mesh2d_manual` doesn't render anything ## Solution - Fix the example using the new `RenderMesh2dInstances`

…sier for other components to adopt. (bevyengine#10002) # Objective Currently, the only way for custom components that participate in rendering to opt into the higher-performance extraction method in bevyengine#9903 is to implement the `RenderInstances` data structure and the extraction logic manually. This is inconvenient compared to the `ExtractComponent` API. ## Solution This commit creates a new `RenderInstance` trait that mirrors the existing `ExtractComponent` method but uses the higher-performance approach that bevyengine#9903 uses. Additionally, `RenderInstance` is more flexible than `ExtractComponent`, because it can extract multiple components at once. This makes high-performance rendering components essentially as easy to write as the existing ones based on component extraction. --- ## Changelog ### Added A new `RenderInstance` trait is available mirroring `ExtractComponent`, but using a higher-performance method to extract one or more components to the render world. If you have custom components that rendering takes into account, you may consider migration from `ExtractComponent` to `RenderInstance` for higher performance.

# Objective - After bevyengine#9903, example `mesh2d_manual` doesn't render anything ## Solution - Fix the example using the new `RenderMesh2dInstances`

…sier for other components to adopt. (bevyengine#10002) # Objective Currently, the only way for custom components that participate in rendering to opt into the higher-performance extraction method in bevyengine#9903 is to implement the `RenderInstances` data structure and the extraction logic manually. This is inconvenient compared to the `ExtractComponent` API. ## Solution This commit creates a new `RenderInstance` trait that mirrors the existing `ExtractComponent` method but uses the higher-performance approach that bevyengine#9903 uses. Additionally, `RenderInstance` is more flexible than `ExtractComponent`, because it can extract multiple components at once. This makes high-performance rendering components essentially as easy to write as the existing ones based on component extraction. --- ## Changelog ### Added A new `RenderInstance` trait is available mirroring `ExtractComponent`, but using a higher-performance method to extract one or more components to the render world. If you have custom components that rendering takes into account, you may consider migration from `ExtractComponent` to `RenderInstance` for higher performance.

# Objective Related to #10472. Not having a hardcoded scale factor makes comparing results from these stress tests difficult. Contributors using high dpi screens may be rendering 4x as many pixels as others (or more). Stress tests may have different behavior when moved from one monitor in a dual setup to another. At very high resolutions, different parts of the engine / hardware are being stressed. 1080p is also a far more common resolution for gaming. ## Solution Use a consistent 1080p with `scale_factor_override: 1.0` everywhere. In #9903, this sort of change was added specifically to `bevymark` and `many_cubes` but it makes sense to do it everywhere. ## Discussion - Maybe we should have a command line option, environment variable, or `CI_TESTING_CONFIG` option for 1080p / 1440p / 4k. - Will these look odd (small text?) when screenshotted and shown in the example showcase? The aspect ratio is the same, but they will be downscaled from 1080p instead of ~720p. - Maybe there are other window properties that should be consistent across stress tests. e.g. `resizable: false`. - Should we add a `stress_test_window(title)` helper or something? - Bevymark (pre-10472) was intentionally 800x600 to match "bunnymark", I believe. I don't personally think this is very important.

For details see bevyengine/bevy#9903.

@superdump

# Objective Keep essentially the same structure of `EntityHasher` from #9903, but rephrase the multiplication slightly to save an instruction. cc @superdump Discord thread: https://discord.com/channels/691052431525675048/1172033156845674507/1174969772522356756 ## Solution Today, the hash is ```rust self.hash = i | (i.wrapping_mul(FRAC_U64MAX_PI) << 32); ``` with `i` being `(generation << 32) | index`. Expanding things out, we get ```rust i | ( (i * CONST) << 32 ) = (generation << 32) | index | ((((generation << 32) | index) * CONST) << 32) = (generation << 32) | index | ((index * CONST) << 32) // because the generation overflowed = (index * CONST | generation) << 32 | index ``` What if we do the same thing, but with `+` instead of `|`? That's almost the same thing, except that it has carries, which are actually often better in a hash function anyway, since it doesn't saturate. (`|` can be dangerous, since once something becomes `-1` it'll stay that, and there's no mixing available.) ```rust (index * CONST + generation) << 32 + index = (CONST << 32 + 1) * index + generation << 32 = (CONST << 32 + 1) * index + (WHATEVER << 32 + generation) << 32 // because the extra overflows and thus can be anything = (CONST << 32 + 1) * index + ((CONST * generation) << 32 + generation) << 32 // pick "whatever" to be something convenient = (CONST << 32 + 1) * index + ((CONST << 32 + 1) * generation) << 32 = (CONST << 32 + 1) * index +((CONST << 32 + 1) * (generation << 32) = (CONST << 32 + 1) * (index + generation << 32) = (CONST << 32 + 1) * (generation << 32 | index) = (CONST << 32 + 1) * i ``` So we can do essentially the same thing using a single multiplication instead of doing multiply-shift-or. LLVM was already smart enough to merge the shifting into a multiplication, but this saves the extra `or`: ![image](https:/bevyengine/bevy/assets/18526288/d9396614-2326-4730-abbe-4908c01b5ace) <https://rust.godbolt.org/z/MEvbz4eo4> It's a very small change, and often will disappear in load latency anyway, but it's a couple percent faster in lookups: ![image](https:/bevyengine/bevy/assets/18526288/c365ec85-6adc-4f6d-8fa6-a65146f55a75) (There was more of an improvement here before #10558, but with `to_bits` being a single `qword` load now, keeping things mostly as it is turned out to be better than the bigger changes I'd tried in #10605.) --- ## Changelog (Probably skip it) ## Migration Guide (none needed)

@SkiFire13

…er performance (bevyengine#9903) # Objective - Improve rendering performance, particularly by avoiding the large system commands costs of using the ECS in the way that the render world does. ## Solution - Define `EntityHasher` that calculates a hash from the `Entity.to_bits()` by `i | (i.wrapping_mul(0x517cc1b727220a95) << 32)`. `0x517cc1b727220a95` is something like `u64::MAX / N` for N that gives a value close to π and that works well for hashing. Thanks for @SkiFire13 for the suggestion and to @nicopap for alternative suggestions and discussion. This approach comes from `rustc-hash` (a.k.a. `FxHasher`) with some tweaks for the case of hashing an `Entity`. `FxHasher` and `SeaHasher` were also tested but were significantly slower. - Define `EntityHashMap` type that uses the `EntityHashser` - Use `EntityHashMap<Entity, T>` for render world entity storage, including: - `RenderMaterialInstances` - contains the `AssetId<M>` of the material associated with the entity. Also for 2D. - `RenderMeshInstances` - contains mesh transforms, flags and properties about mesh entities. Also for 2D. - `SkinIndices` and `MorphIndices` - contains the skin and morph index for an entity, respectively - `ExtractedSprites` - `ExtractedUiNodes` ## Benchmarks All benchmarks have been conducted on an M1 Max connected to AC power. The tests are run for 1500 frames. The 1000th frame is captured for comparison to check for visual regressions. There were none. ### 2D Meshes `bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d` #### `--ordered-z` This test spawns the 2D meshes with z incrementing back to front, which is the ideal arrangement allocation order as it matches the sorted render order which means lookups have a high cache hit rate. <img width="1112" alt="Screenshot 2023-09-27 at 07 50 45" src="https:/bevyengine/bevy/assets/302146/e140bc98-7091-4a3b-8ae1-ab75d16d2ccb"> -39.1% median frame time. #### Random This test spawns the 2D meshes with random z. This not only makes the batching and transparent 2D pass lookups get a lot of cache misses, it also currently means that the meshes are almost certain to not be batchable. <img width="1108" alt="Screenshot 2023-09-27 at 07 51 28" src="https:/bevyengine/bevy/assets/302146/29c2e813-645a-43ce-982a-55df4bf7d8c4"> -7.2% median frame time. ### 3D Meshes `many_cubes --benchmark` <img width="1112" alt="Screenshot 2023-09-27 at 07 51 57" src="https:/bevyengine/bevy/assets/302146/1a729673-3254-4e2a-9072-55e27c69f0fc"> -7.7% median frame time. ### Sprites **NOTE: On `main` sprites are using `SparseSet<Entity, T>`!** `bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite` #### `--ordered-z` This test spawns the sprites with z incrementing back to front, which is the ideal arrangement allocation order as it matches the sorted render order which means lookups have a high cache hit rate. <img width="1116" alt="Screenshot 2023-09-27 at 07 52 31" src="https:/bevyengine/bevy/assets/302146/bc8eab90-e375-4d31-b5cd-f55f6f59ab67"> +13.0% median frame time. #### Random This test spawns the sprites with random z. This makes the batching and transparent 2D pass lookups get a lot of cache misses. <img width="1109" alt="Screenshot 2023-09-27 at 07 53 01" src="https:/bevyengine/bevy/assets/302146/22073f5d-99a7-49b0-9584-d3ac3eac3033"> +0.6% median frame time. ### UI **NOTE: On `main` UI is using `SparseSet<Entity, T>`!** `many_buttons` <img width="1111" alt="Screenshot 2023-09-27 at 07 53 26" src="https:/bevyengine/bevy/assets/302146/66afd56d-cbe4-49e7-8b64-2f28f6043d85"> +15.1% median frame time. ## Alternatives - Cart originally suggested trying out `SparseSet<Entity, T>` and indeed that is slightly faster under ideal conditions. However, `PassHashMap<Entity, T>` has better worst case performance when data is randomly distributed, rather than in sorted render order, and does not have the worst case memory usage that `SparseSet`'s dense `Vec<usize>` that maps from the `Entity` index to sparse index into `Vec<T>`. This dense `Vec` has to be as large as the largest Entity index used with the `SparseSet`. - I also tested `PassHashMap<u32, T>`, intending to use `Entity.index()` as the key, but this proved to sometimes be slower and mostly no different. - The only outstanding approach that has not been implemented and tested is to _not_ clear the render world of its entities each frame. That has its own problems, though they could perhaps be solved. - Performance-wise, if the entities and their component data were not cleared, then they would incur table moves on spawn, and should not thereafter, rather just their component data would be overwritten. Ideally we would have a neat way of either updating data in-place via `&mut T` queries, or inserting components if not present. This would likely be quite cumbersome to have to remember to do everywhere, but perhaps it only needs to be done in the more performance-sensitive systems. - The main problem to solve however is that we want to both maintain a mapping between main world entities and render world entities, be able to run the render app and world in parallel with the main app and world for pipelined rendering, and at the same time be able to spawn entities in the render world in such a way that those Entity ids do not collide with those spawned in the main world. This is potentially quite solvable, but could well be a lot of ECS work to do it in a way that makes sense. --- ## Changelog - Changed: Component data for entities to be drawn are no longer stored on entities in the render world. Instead, data is stored in a `EntityHashMap<Entity, T>` in various resources. This brings significant performance benefits due to the way the render app clears entities every frame. Resources of most interest are `RenderMeshInstances` and `RenderMaterialInstances`, and their 2D counterparts. ## Migration Guide Previously the render app extracted mesh entities and their component data from the main world and stored them as entities and components in the render world. Now they are extracted into essentially `EntityHashMap<Entity, T>` where `T` are structs containing an appropriate group of data. This means that while extract set systems will continue to run extract queries against the main world they will store their data in hash maps. Also, systems in later sets will either need to look up entities in the available resources such as `RenderMeshInstances`, or maintain their own `EntityHashMap<Entity, T>` for their own data. Before: ```rust fn queue_custom( material_meshes: Query<(Entity, &MeshTransforms, &Handle<Mesh>), With<InstanceMaterialData>>, ) { ... for (entity, mesh_transforms, mesh_handle) in &material_meshes { ... } } ``` After: ```rust fn queue_custom( render_mesh_instances: Res<RenderMeshInstances>, instance_entities: Query<Entity, With<InstanceMaterialData>>, ) { ... for entity in &instance_entities { let Some(mesh_instance) = render_mesh_instances.get(&entity) else { continue; }; // The mesh handle in `AssetId<Mesh>` form, and the `MeshTransforms` can now // be found in `mesh_instance` which is a `RenderMeshInstance` ... } } ``` --------- Co-authored-by: robtfm <[email protected]>

# Objective - After bevyengine#9903, example `mesh2d_manual` doesn't render anything ## Solution - Fix the example using the new `RenderMesh2dInstances`

…sier for other components to adopt. (bevyengine#10002) # Objective Currently, the only way for custom components that participate in rendering to opt into the higher-performance extraction method in bevyengine#9903 is to implement the `RenderInstances` data structure and the extraction logic manually. This is inconvenient compared to the `ExtractComponent` API. ## Solution This commit creates a new `RenderInstance` trait that mirrors the existing `ExtractComponent` method but uses the higher-performance approach that bevyengine#9903 uses. Additionally, `RenderInstance` is more flexible than `ExtractComponent`, because it can extract multiple components at once. This makes high-performance rendering components essentially as easy to write as the existing ones based on component extraction. --- ## Changelog ### Added A new `RenderInstance` trait is available mirroring `ExtractComponent`, but using a higher-performance method to extract one or more components to the render world. If you have custom components that rendering takes into account, you may consider migration from `ExtractComponent` to `RenderInstance` for higher performance.

…ine#10474) # Objective Related to bevyengine#10472. Not having a hardcoded scale factor makes comparing results from these stress tests difficult. Contributors using high dpi screens may be rendering 4x as many pixels as others (or more). Stress tests may have different behavior when moved from one monitor in a dual setup to another. At very high resolutions, different parts of the engine / hardware are being stressed. 1080p is also a far more common resolution for gaming. ## Solution Use a consistent 1080p with `scale_factor_override: 1.0` everywhere. In bevyengine#9903, this sort of change was added specifically to `bevymark` and `many_cubes` but it makes sense to do it everywhere. ## Discussion - Maybe we should have a command line option, environment variable, or `CI_TESTING_CONFIG` option for 1080p / 1440p / 4k. - Will these look odd (small text?) when screenshotted and shown in the example showcase? The aspect ratio is the same, but they will be downscaled from 1080p instead of ~720p. - Maybe there are other window properties that should be consistent across stress tests. e.g. `resizable: false`. - Should we add a `stress_test_window(title)` helper or something? - Bevymark (pre-10472) was intentionally 800x600 to match "bunnymark", I believe. I don't personally think this is very important.

Use PassHashMap<Entity, T> for render world entity storage for better…

84fd754

… performance

Fix wireframe and clippy

c3ef067

SkiFire13 reviewed Sep 23, 2023

View reviewed changes

cart approved these changes Sep 25, 2023

View reviewed changes

Use all 64-bits of the Entity for a hash all the time.

ad94257

Fixes PassHash only using Entity.index.

nicopap reviewed Sep 26, 2023

View reviewed changes

crates/bevy_pbr/src/render/mesh.rs Show resolved Hide resolved

nicopap reviewed Sep 26, 2023

View reviewed changes

crates/bevy_pbr/src/render/mesh.rs Show resolved Hide resolved

superdump commented Sep 26, 2023

View reviewed changes

Cargo.toml Outdated Show resolved Hide resolved

nicopap reviewed Sep 26, 2023

View reviewed changes

nicopap approved these changes Sep 26, 2023

View reviewed changes

superdump force-pushed the render-world-use-passhashmap branch from 18802e5 to 312de3f Compare September 26, 2023 12:27

Merge branch 'main' into render-world-use-passhashmap

f904037

superdump force-pushed the render-world-use-passhashmap branch from 312de3f to f904037 Compare September 26, 2023 12:28

superdump added 2 commits September 26, 2023 14:34

Move imports

11a1612

Improve docs

249b6a3

superdump force-pushed the render-world-use-passhashmap branch from 638ece8 to 249b6a3 Compare September 26, 2023 12:44

superdump added 3 commits September 26, 2023 16:29

Use Skifire's suggested hash function and rename to EntityHashMap

98b34f6

Use the old capacity method for allocating the vec for mesh extraction

df0502c

Document the magic constant

903d5ab

superdump changed the title ~~Use PassHashMap<Entity, T> for render world entity storage for better performance~~ Use EntityHashMap<Entity, T> for render world entity storage for better performance Sep 26, 2023

superdump added 3 commits September 26, 2023 20:29

CI documentation fix

91afe4d

CI fixes

07aae97

Revert ExtractComponent impl for Handle<T> to using clone_weak() inst…

eac5b46

…ead of id()

Merged via the queue into bevyengine:main with commit b6ead2b Sep 27, 2023
21 checks passed

mockersf mentioned this pull request Sep 27, 2023

fix example mesh2d_manual #9941

Merged

rparrett mentioned this pull request Sep 29, 2023

UI batching Fix #9610

Merged

pcwalton mentioned this pull request Oct 3, 2023

Refactor the render instance logic in #9903 so that it's easier for other components to adopt. #10002

Merged

github-merge-queue bot pushed a commit that referenced this pull request Oct 6, 2023

fix example mesh2d_manual (#9941)

154a490

# Objective - After #9903, example `mesh2d_manual` doesn't render anything ## Solution - Fix the example using the new `RenderMesh2dInstances`

This was referenced Oct 9, 2023

UI borders "randomly" do not render #10069

Closed

Text2d visibility fix #9708

Closed

cart mentioned this pull request Oct 13, 2023

News: Release 0.12 bevyengine/bevy-website#754

Merged

43 tasks

ameknite pushed a commit to ameknite/bevy that referenced this pull request Nov 6, 2023

fix example mesh2d_manual (bevyengine#9941)

16e219f

# Objective - After bevyengine#9903, example `mesh2d_manual` doesn't render anything ## Solution - Fix the example using the new `RenderMesh2dInstances`

This was referenced Nov 9, 2023

Revert changes to bevymark and many_cubes window settings #10472

Closed

Use a consistent scale factor and resolution in stress tests #10474

Merged

This was referenced Nov 17, 2023

Tweak EntityHasher for more speed #10605

Closed

Save an instruction in EntityHasher #10648

Merged

Shatur added a commit to projectharmonia/bevy_replicon that referenced this pull request Nov 24, 2023

Use EntityHashMap instead of all HashMap with entities as keys

979c902

For details see bevyengine/bevy#9903.

Shatur mentioned this pull request Nov 24, 2023

Refactoring in preparation for manual packet fragmentation projectharmonia/bevy_replicon#114

Merged

Shatur added a commit to projectharmonia/bevy_replicon that referenced this pull request Nov 24, 2023

Use EntityHashMap instead of HashMap with entities as keys

3793d01

For details see bevyengine/bevy#9903.

Shatur added a commit to projectharmonia/bevy_replicon that referenced this pull request Nov 26, 2023

Use EntityHashMap instead of HashMap with entities as keys

d54c75c

For details see bevyengine/bevy#9903.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use EntityHashMap<Entity, T> for render world entity storage for better performance #9903

Use EntityHashMap<Entity, T> for render world entity storage for better performance #9903

superdump commented Sep 22, 2023 •

edited

Loading

superdump commented Sep 22, 2023

JMS55 commented Sep 23, 2023

superdump commented Sep 23, 2023

SkiFire13 Sep 23, 2023

superdump Sep 23, 2023 •

edited

Loading

SkiFire13 Sep 23, 2023

cart Sep 25, 2023

superdump Sep 25, 2023

superdump Sep 26, 2023 •

edited

Loading

SkiFire13 Sep 26, 2023 •

edited

Loading

superdump Sep 26, 2023

superdump Sep 26, 2023

SkiFire13 Sep 26, 2023

cart left a comment

nicopap commented Sep 26, 2023

superdump commented Sep 26, 2023

nicopap commented Sep 26, 2023

nicopap Sep 26, 2023 •

edited

Loading

superdump Sep 26, 2023

nicopap Sep 26, 2023

superdump Sep 26, 2023

superdump Sep 26, 2023

superdump commented Sep 26, 2023

nicopap Sep 26, 2023

superdump Sep 26, 2023

nicopap Sep 26, 2023

superdump Sep 26, 2023

Use EntityHashMap<Entity, T> for render world entity storage for better performance #9903

Use EntityHashMap<Entity, T> for render world entity storage for better performance #9903

Conversation

superdump commented Sep 22, 2023 • edited Loading

Objective

Solution

Benchmarks

2D Meshes

--ordered-z

Random

3D Meshes

Sprites

--ordered-z

Random

UI

Alternatives

Changelog

Migration Guide

superdump commented Sep 22, 2023

JMS55 commented Sep 23, 2023

superdump commented Sep 23, 2023

Choose a reason for hiding this comment

superdump Sep 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

superdump Sep 26, 2023 • edited Loading

Choose a reason for hiding this comment

SkiFire13 Sep 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ordered z spawning

Random z spawning

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cart left a comment

Choose a reason for hiding this comment

nicopap commented Sep 26, 2023

superdump commented Sep 26, 2023

nicopap commented Sep 26, 2023

nicopap Sep 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

superdump commented Sep 26, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

superdump commented Sep 22, 2023 •

edited

Loading

`--ordered-z`

`--ordered-z`

superdump Sep 23, 2023 •

edited

Loading

superdump Sep 26, 2023 •

edited

Loading

SkiFire13 Sep 26, 2023 •

edited

Loading

nicopap Sep 26, 2023 •

edited

Loading