Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce the size of MeshUniform to improve performance #9416

Merged
merged 7 commits into from
Aug 15, 2023

Conversation

superdump
Copy link
Contributor

Objective

  • Significantly reduce the size of MeshUniform by only including necessary data.

Solution

Local to world, model transforms are affine. This means they only need a 4x3 matrix to represent them.

MeshUniform stores the current, and previous model transforms, and the inverse transpose of the current model transform, all as 4x4 matrices. Instead we can store the current, and previous model transforms as 4x3 matrices, and we only need the upper-left 3x3 part of the inverse transpose of the current model transform. This change allows us to reduce the serialized MeshUniform size from 208 bytes to 144 bytes, which is over a 30% saving in data to serialize, and VRAM bandwidth and space.

Benchmarks

On an M1 Max, running many_cubes -- sphere, main is in yellow, this PR is in red:
Screenshot 2023-08-11 at 02 36 43
A reduction in frame time of ~14%.


Changelog

  • Changed: Redefined MeshUniform to improve performance by using 4x3 affine transforms and reconstructing 4x4 matrices in the shader. Helper functions were added to bevy_pbr::mesh_functions to unpack the data. affine_to_square converts the packed 4x3 in 3x4 matrix data to a 4x4 matrix. mat2x4_f32_to_mat3x3 converts the 3x3 in mat2x4 + f32 matrix data back into a 3x3.

Migration Guide

Shader code before:

var model = mesh[instance_index].model;

Shader code after:

#import bevy_pbr::mesh_functions affine_to_square

var model = affine_to_square(mesh[instance_index].model);

@superdump superdump requested a review from robtfm August 11, 2023 00:44
Before:
mat4x4 x3
u32
= 196 bytes, but in an array, practically 208 bytes for 16-byte alignment.

After:
mat3x4 x2
mat2x4
f32
u32
= 136 bytes, but in an array, practically 144 bytes for 16-byte alignment.

That is a reduction of over 30% VRAM space and bandwidth usage, plus less data
to serialize.
@JMS55
Copy link
Contributor

JMS55 commented Aug 11, 2023

For further saving, when motion vectors are not enabled, we should remove the previous model transform altogether.

@alice-i-cecile alice-i-cecile added A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times labels Aug 11, 2023
Copy link
Member

@james7132 james7132 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Real quick scan, not a full review. Great to see the memory reduction and frame time gains though.

#[derive(Component, ShaderType, Clone)]
#[derive(Component)]
pub struct MeshTransforms {
pub transform: Affine3A,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're now doing nontrivial transformation into MeshUniform already, can we try using Affine3 over Affine3A here? We could save more CPU side memory and reduce command time costs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can for sure try it to see if it performs better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no glam::Affine3. It looks like we only use the translation from the transform for view z calculations in the render app. I could then change them out for our own implementation of Affine3 that just has conversion to Affine3A implemented in case someone needs to do calculations with it.

Copy link
Contributor Author

@superdump superdump Aug 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did that and retested (M1 Max, on AC power and fully charged, using my regular DPI external monitor, many_cubes -- sphere for 1500 frames).

main (yellow) vs PR (red):
frame time:
Screenshot 2023-08-12 at 12 52 55
extract_meshes system:
Screenshot 2023-08-12 at 12 53 35
extract_meshes system commands:
Screenshot 2023-08-12 at 12 53 43

main (yellow) vs PR with MeshTransforms using custom Affine3:
frame time:
Screenshot 2023-08-12 at 12 53 57
extract_meshes system:
Screenshot 2023-08-12 at 12 54 11
extract_meshes system commands:
Screenshot 2023-08-12 at 12 54 26

Summary:

  • frame times vs main in median frame time for this test run:
    • PR gives 6% reduction
    • PR + Affine3 gives 14% reduction

So it does indeed seem like adding struct Affine3 { matrix3: Mat3, translation: Vec3 } is worth it. Done.

In rendering code there is a significant performance benefit from using only
Mat3 + Vec3 instead of Mat3A + Vec3A for mesh transforms as their storage size
dominates performance.
crates/bevy_math/src/affine3.rs Outdated Show resolved Hide resolved
@@ -23,7 +23,7 @@ struct VertexOutput {
fn vertex(vertex: Vertex) -> VertexOutput {
var out: VertexOutput;
out.clip_position = mesh_position_local_to_clip(
mesh[bevy_render::instance_index::get_instance_index(vertex.instance_index)].model,
affine_to_square(mesh[bevy_render::instance_index::get_instance_index(vertex.instance_index)].model),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think perhaps we should embrace the affinity and make the 3x4 the arg to the mesh_xxx and skin_normal functions, and the return of the skin_model function. maybe not very important since its only used per vertex but it would be cleaner.

then this would be clearer/easier to understand if the mesh.model (and the mesh_xxx arg) was a mat4x3 instead of a transpose ... is there a good reason to use 3x4 instead? the packing logic in mesh.rs is already quite complex so it probably wouldn't be worse. i guess MeshUniform would contain a [f32;12] instead of a [Vec4;3].

the downside would be if a caller has a 4x4 and wants to use the 4x3 functions they would have to use mesh_xxx(mat4x3(my4x4[0].xyz, my4x4[1].xyz, my4x4[2].xyz, my4x4[3].xyz)) - i guess we could add a utility conversion function for that if there isn't a better builtin way (i don't think there is).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think perhaps we should embrace the affinity and make the 3x4 the arg to the mesh_xxx and skin_normal functions, and the return of the skin_model function. maybe not very important since its only used per vertex but it would be cleaner.

Then each of those would have to do the conversions or be less-readable code, and if anyone wanted to use the transforms for something else then they would have to figure out how.

I am rather feeling like we may want to add a get_model_matrix(instance_index) -> mat4x4<f32> function.

then this would be clearer/easier to understand if the mesh.model (and the mesh_xxx arg) was a mat4x3 instead of a transpose ... is there a good reason to use 3x4 instead? the packing logic in mesh.rs is already quite complex so it probably wouldn't be worse. i guess MeshUniform would contain a [f32;12] instead of a [Vec4;3].

4x3 takes the same space in uniform/storage buffers as 4x4 because it's practically 3 vec3s, and vec3s have an alignment of 16 so each vec3 starts at a 16-byte alignment. The shader would then error or misinterpret the data in the binding.

the downside would be if a caller has a 4x4 and wants to use the 4x3 functions they would have to use mesh_xxx(mat4x3(my4x4[0].xyz, my4x4[1].xyz, my4x4[2].xyz, my4x4[3].xyz)) - i guess we could add a utility conversion function for that if there isn't a better builtin way (i don't think there is).

Projection matrices are necessarily 4x4. so there we have to have a 4x4 matrix. One thing to note is that I have been told that the shader compiler (the GPU vendor one I guess?) will optimise away the multiplications by 0.0 and 1.0, so adding them back to make it a 4x4 does not hurt performance, it only makes code more readable and the matrix more usable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a get_model_matrix(instance_index: u32) -> mat4x4<f32> helper function and similar for previous model matrix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everything makes sense given the size of mat4x3 is 64 bytes, i didn't realise that. then there's no purpose to 4x3 as input to the functions, and we would lose the size benefit if it was the unfiorm type. 👍

crates/bevy_pbr/src/render/mesh.rs Outdated Show resolved Hide resolved
crates/bevy_pbr/src/render/mesh_functions.wgsl Outdated Show resolved Hide resolved
@superdump
Copy link
Contributor Author

For further saving, when motion vectors are not enabled, we should remove the previous model transform altogether.

I tried looking into this but it adds a lot of code complexity to optionally have the field or not, so I'm choosing to just always have it.

@robtfm
Copy link
Contributor

robtfm commented Aug 13, 2023

last thing to change on the exported function in the resolved thread then all good

@superdump superdump added this pull request to the merge queue Aug 14, 2023
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 14, 2023
@superdump superdump added this pull request to the merge queue Aug 14, 2023
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 14, 2023
@superdump superdump added this pull request to the merge queue Aug 15, 2023
Merged via the queue into bevyengine:main with commit 0a11af9 Aug 15, 2023
20 checks passed
github-merge-queue bot pushed a commit that referenced this pull request Aug 21, 2023
# Objective

- Wireframe currently don't display since #9416
- There is an error
```
2023-08-20T10:06:54.190347Z ERROR bevy_render::render_resource::pipeline_cache: failed to process shader:
error: no definition in scope for identifier: 'vertex_no_morph'
   ┌─ crates/bevy_pbr/src/render/wireframe.wgsl:26:94
   │
26 │     let model = bevy_pbr::mesh_functions::get_model_matrix(vertex_no_morph.instance_index);
   │                                                                                              ^^^^^^^^^^^^^^^ unknown identifier
   │
   = no definition in scope for identifier: 'vertex_no_morph'
```

## Solution

- Use the correct identifier
github-merge-queue bot pushed a commit that referenced this pull request Aug 21, 2023
# Objective

- Since #9416, example shader_material_glsl is broken
<img width="1280" alt="Screenshot 2023-08-20 at 21 08 41"
src="https:/bevyengine/bevy/assets/8672791/16bc15ee-a58c-4ec6-87a8-c3799a6df74a">

## Solution

- Apply the same function as other examples, but in glsl
@cart cart mentioned this pull request Oct 13, 2023
43 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants