Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New fmt::Arguments representation. #115129

Closed
wants to merge 7 commits into from

Conversation

m-ou-se
Copy link
Member

@m-ou-se m-ou-se commented Aug 23, 2023

This changes the internal representation of fmt::Arguments to reduce it in size and make it possible to convert a &'static str to a fmt::Arguments without additional indirection.

Part of #99012

fmt::Arguments stores three slices:

  1. The slice of string literals between the placeholders.
  2. The details of the placeholders (format options, which argument to display there, etc.).
  3. The arguments.

2 is omitted when it's just all arguments once in order without additional options.

Before this change, fmt::Arguments stores each of these as a &[] (or Option of that), resulting in a total size of six words/pointers.

However, we don't need to store the length of each of these slices separately, as we may (unsafely) assume that the indexes of the fmt::Arguments generated by format_args!() are never out of bounds, and the number of placeholders and number of strings between them must be nearly equal.

This PR changes the struct to store three pointers without length, and a single 'number of parts' counter that is the sum of the number of placeholders and string pieces.

Additionally, this PR adds a special case for 1 part (that is, one string piece and no placeholders) to store the string pointer directly instead of through a slice of length 1.

This makes it possible for fmt::Arguments::new_str(&'static str) to exist, which unlike before (new_const(&[&'static str])) doesn't need to borrow anything other than the string itself.

@m-ou-se m-ou-se added T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue. A-fmt Area: `std::fmt` labels Aug 23, 2023
@m-ou-se m-ou-se self-assigned this Aug 23, 2023
@rustbot rustbot added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Aug 23, 2023
@rust-log-analyzer

This comment has been minimized.

library/core/src/fmt/mod.rs Outdated Show resolved Hide resolved
@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@m-ou-se
Copy link
Member Author

m-ou-se commented Aug 23, 2023

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Aug 23, 2023
@bors
Copy link
Contributor

bors commented Aug 23, 2023

⌛ Trying commit 6091d4dbd9e0cae96c550bd08ed860d3c76df120 with merge 0650fbf1bf480ad163fba30602ccdf6013aacd39...

@bors
Copy link
Contributor

bors commented Aug 23, 2023

☀️ Try build successful - checks-actions
Build commit: 0650fbf1bf480ad163fba30602ccdf6013aacd39 (0650fbf1bf480ad163fba30602ccdf6013aacd39)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (0650fbf1bf480ad163fba30602ccdf6013aacd39): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
1.0% [0.3%, 4.5%] 10
Regressions ❌
(secondary)
0.3% [0.2%, 0.6%] 4
Improvements ✅
(primary)
-0.5% [-1.6%, -0.2%] 40
Improvements ✅
(secondary)
-0.7% [-1.0%, -0.2%] 19
All ❌✅ (primary) -0.2% [-1.6%, 4.5%] 50

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
4.9% [0.2%, 11.8%] 7
Regressions ❌
(secondary)
7.4% [2.7%, 13.2%] 6
Improvements ✅
(primary)
-3.6% [-5.2%, -0.5%] 6
Improvements ✅
(secondary)
-2.8% [-3.8%, -1.8%] 2
All ❌✅ (primary) 1.0% [-5.2%, 11.8%] 13

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
2.2% [1.4%, 3.3%] 3
Regressions ❌
(secondary)
1.0% [1.0%, 1.0%] 1
Improvements ✅
(primary)
-2.5% [-5.0%, -0.8%] 27
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) -2.0% [-5.0%, 3.3%] 30

Binary size

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.5% [0.1%, 1.2%] 67
Regressions ❌
(secondary)
0.7% [0.2%, 1.3%] 11
Improvements ✅
(primary)
-0.2% [-1.4%, -0.0%] 27
Improvements ✅
(secondary)
-0.4% [-1.3%, -0.1%] 8
All ❌✅ (primary) 0.3% [-1.4%, 1.2%] 94

Bootstrap: 635.819s -> 633.195s (-0.41%)
Artifact size: 347.15 MiB -> 346.99 MiB (-0.05%)

@rustbot rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Aug 24, 2023
@m-ou-se
Copy link
Member Author

m-ou-se commented Aug 24, 2023

That looks pretty good, but I'm a bit surprised to see a few things like helloworld incr-unchanged go up (by 0.55%).

@m-ou-se
Copy link
Member Author

m-ou-se commented Aug 24, 2023

Let's see what happens if I add a few #[inline(always)]...

@bors try @rust-timer queue

@bors
Copy link
Contributor

bors commented Aug 24, 2023

⌛ Trying commit d1922d5ad206618355b837e688e857dd38e5dffc with merge 0208b67f46c3dc86f3df10c3e01fb70047717ff6...

@bors
Copy link
Contributor

bors commented Aug 30, 2023

☀️ Try build successful - checks-actions
Build commit: 3373eb1ec6feda72af1b09f19aba78a9e3a88f39 (3373eb1ec6feda72af1b09f19aba78a9e3a88f39)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (3373eb1ec6feda72af1b09f19aba78a9e3a88f39): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.8% [0.3%, 4.6%] 14
Regressions ❌
(secondary)
0.5% [0.2%, 0.8%] 6
Improvements ✅
(primary)
-0.7% [-1.1%, -0.3%] 16
Improvements ✅
(secondary)
-0.3% [-0.3%, -0.3%] 7
All ❌✅ (primary) -0.0% [-1.1%, 4.6%] 30

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
2.9% [0.1%, 7.6%] 8
Regressions ❌
(secondary)
1.7% [0.8%, 2.4%] 5
Improvements ✅
(primary)
-2.9% [-7.3%, -0.7%] 10
Improvements ✅
(secondary)
-3.7% [-6.2%, -1.7%] 5
All ❌✅ (primary) -0.4% [-7.3%, 7.6%] 18

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
4.1% [4.1%, 4.1%] 1
Regressions ❌
(secondary)
2.6% [2.2%, 2.9%] 2
Improvements ✅
(primary)
-2.6% [-4.9%, -1.2%] 6
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) -1.7% [-4.9%, 4.1%] 7

Binary size

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.2% [0.0%, 1.1%] 60
Regressions ❌
(secondary)
0.3% [0.1%, 0.7%] 10
Improvements ✅
(primary)
-0.2% [-0.6%, -0.0%] 28
Improvements ✅
(secondary)
-0.3% [-0.4%, -0.1%] 7
All ❌✅ (primary) 0.1% [-0.6%, 1.1%] 88

Bootstrap: 630.718s -> 629.174s (-0.24%)
Artifact size: 316.56 MiB -> 316.53 MiB (-0.01%)

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Aug 30, 2023
@m-ou-se
Copy link
Member Author

m-ou-se commented Aug 30, 2023

Well that's quite a mixed bag of perf results. (: Also a few runtime benchmarks degraded quite a bit. If anyone feels like helping out: I could use some help figuring out what is causing these results. ^^

@Kobzol
Copy link
Contributor

Kobzol commented Sep 14, 2023

GitHub is now somehow super eager to delete old try commits. I'll rerun perf so that I can get a baseline commit SHA to compare against.

@bory try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Sep 14, 2023
@Kobzol
Copy link
Contributor

Kobzol commented Sep 15, 2023

@bors try

@bors
Copy link
Contributor

bors commented Sep 15, 2023

⌛ Trying commit 453eba0 with merge cf92f86...

bors added a commit to rust-lang-ci/rust that referenced this pull request Sep 15, 2023
New fmt::Arguments representation.

This changes the internal representation of `fmt::Arguments` to reduce it in size and make it possible to convert a `&'static str` to a `fmt::Arguments` without additional indirection.

Part of rust-lang#99012

fmt::Arguments stores three slices:

1. The slice of string literals between the placeholders.
2. The details of the placeholders (format options, which argument to display there, etc.).
3. The arguments.

2 is omitted when it's just all arguments once in order without additional options.

Before this change, fmt::Arguments stores each of these as a &[] (or Option of that), resulting in a total size of six words/pointers.

However, we don't need to store the length of each of these slices separately, as we may (unsafely) assume that the indexes of the fmt::Arguments generated by format_args!() are never out of bounds, and the number of placeholders and number of strings between them must be nearly equal.

This PR changes the struct to store three pointers without length, and a single 'number of parts' counter that is the sum of the number of placeholders and string pieces.

Additionally, this PR adds a special case for 1 part (that is, one string piece and no placeholders) to store the string pointer directly instead of through a slice of length 1.

This makes it possible for `fmt::Arguments::new_str(&'static str)` to exist, which unlike before (`new_const(&[&'static str])`) doesn't need to borrow anything other than the string itself.
@bors
Copy link
Contributor

bors commented Sep 15, 2023

☀️ Try build successful - checks-actions
Build commit: cf92f86 (cf92f86bcc15ebf3a31a9467a890c989b7665512)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (cf92f86): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
1.8% [0.3%, 4.5%] 3
Regressions ❌
(secondary)
0.5% [0.5%, 0.5%] 1
Improvements ✅
(primary)
-0.8% [-1.6%, -0.4%] 24
Improvements ✅
(secondary)
-1.1% [-1.6%, -0.6%] 11
All ❌✅ (primary) -0.5% [-1.6%, 4.5%] 27

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
3.9% [0.2%, 5.8%] 5
Regressions ❌
(secondary)
2.1% [2.1%, 2.1%] 1
Improvements ✅
(primary)
-2.6% [-5.6%, -0.6%] 6
Improvements ✅
(secondary)
-3.1% [-4.7%, -2.1%] 4
All ❌✅ (primary) 0.4% [-5.6%, 5.8%] 11

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
3.6% [3.6%, 3.6%] 1
Regressions ❌
(secondary)
3.1% [2.9%, 3.5%] 4
Improvements ✅
(primary)
-2.3% [-4.9%, -1.1%] 8
Improvements ✅
(secondary)
-1.3% [-1.3%, -1.3%] 1
All ❌✅ (primary) -1.7% [-4.9%, 3.6%] 9

Binary size

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.5% [0.0%, 1.3%] 67
Regressions ❌
(secondary)
0.4% [0.2%, 0.7%] 10
Improvements ✅
(primary)
-0.2% [-0.8%, -0.0%] 26
Improvements ✅
(secondary)
-0.3% [-1.3%, -0.1%] 16
All ❌✅ (primary) 0.3% [-0.8%, 1.3%] 93

Bootstrap: 633.484s -> 629.614s (-0.61%)
Artifact size: 318.18 MiB -> 317.59 MiB (-0.19%)

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Sep 15, 2023
@Kobzol
Copy link
Contributor

Kobzol commented Sep 17, 2023

I analyzed the fmt runtime benchmark using the following commands:

# baseline commit
$ rustup-toolchain-install-master e7f9f48d767c1e17e125d83e3102d568f87b868d
# PR commit
$ rustup-toolchain-install-master cf92f86bcc15ebf3a31a9467a890c989b7665512

# analyse
$ RUST_LOG=debug cargo run --bin collector codegen_diff [asm/llvm/mir] fmt `rustup +e7f9f48d767c1e17e125d83e3102d568f87b868d which rustc` `rustup +cf92f86bcc15ebf3a31a9467a890c989b7665512 which rustc` > diff.txt

I haven't really found any interesting differences between ASM and LLVM. The only thing I noticed is that MIR got bigger for one function, probably because of different MIR inlining (?).

MIR diff
Function size stats:
fn main::{closure#0}(_1: &mut [closure@src/main.rs:17:14: 17:17], _2: u64) -> Person: 5848 -> 9140 (+56.29%)

Diff for fn main::{closure#0}(_1: &mut [closure@src/main.rs:17:14: 17:17], _2: u64) -> Person
-----------------------------
 fn main::{closure#0}(_1: &mut [closure@src/main.rs:17:14: 17:17], _2: u64) -> Person {
     debug i => _2;
     let mut _0: Person;
     let mut _3: std::string::String;
     let _4: std::string::String;
     let mut _5: std::fmt::Arguments<'_>;
     let mut _6: &[&str];
     let mut _7: &[core::fmt::rt::Argument<'_>];
     let mut _8: &[core::fmt::rt::Argument<'_>; 1];
     let _9: [core::fmt::rt::Argument<'_>; 1];
     let mut _10: core::fmt::rt::Argument<'_>;
     let mut _11: &u64;
     let mut _12: std::string::String;
     let _13: std::string::String;
     let mut _14: std::fmt::Arguments<'_>;
     let mut _15: &[&str];
     let mut _16: &[core::fmt::rt::Argument<'_>];
     let mut _17: &[core::fmt::rt::Argument<'_>; 1];
     let _18: [core::fmt::rt::Argument<'_>; 1];
     let mut _19: core::fmt::rt::Argument<'_>;
     let mut _20: &u64;
     let mut _21: u8;
     let mut _22: u64;
     let mut _23: u64;
     let mut _24: &[&str; 1];
     let mut _25: &[&str; 1];
     scope 1 {
         debug res => _4;
     }
     scope 2 {
         debug res => _13;
     }
     scope 3 (inlined core::fmt::rt::Argument::<'_>::new_display::<u64>) {
         debug x => _11;
         let mut _26: for<'a, 'b, 'c> fn(&'a u64, &'b mut std::fmt::Formatter<'c>) -> std::result::Result<(), std::fmt::Error>;
         scope 4 (inlined core::fmt::rt::Argument::<'_>::new::<u64>) {
             debug x => _11;
             debug f => _26;
             let mut _27: for<'a, 'b, 'c> fn(&'a core::fmt::rt::Opaque, &'b mut std::fmt::Formatter<'c>) -> std::result::Result<(), std::fmt::Error>;
             let mut _28: &core::fmt::rt::Opaque;
             scope 5 {
             }
         }
     }
-    scope 6 (inlined core::fmt::rt::Argument::<'_>::new_display::<u64>) {
+    scope 6 (inlined format) {
+        debug args => _5;
+        let mut _29: std::option::Option<&str>;
+        let mut _30: &std::fmt::Arguments<'_>;
+        let mut _31: [closure@std::fmt::format::{closure#0}];
+        let mut _32: &std::fmt::Arguments<'_>;
+        scope 7 (inlined Arguments::<'_>::as_str) {
+            debug self => _30;
+            let mut _33: usize;
+            let mut _34: std::num::NonZeroUsize;
+            let mut _35: &str;
+            scope 8 {
+            }
+            scope 9 (inlined NonZeroUsize::get) {
+                debug self => _34;
+            }
+        }
+    }
+    scope 10 (inlined core::fmt::rt::Argument::<'_>::new_display::<u64>) {
         debug x => _20;
-        let mut _29: for<'a, 'b, 'c> fn(&'a u64, &'b mut std::fmt::Formatter<'c>) -> std::result::Result<(), std::fmt::Error>;
-        scope 7 (inlined core::fmt::rt::Argument::<'_>::new::<u64>) {
+        let mut _36: for<'a, 'b, 'c> fn(&'a u64, &'b mut std::fmt::Formatter<'c>) -> std::result::Result<(), std::fmt::Error>;
+        scope 11 (inlined core::fmt::rt::Argument::<'_>::new::<u64>) {
             debug x => _20;
-            debug f => _29;
-            let mut _30: for<'a, 'b, 'c> fn(&'a core::fmt::rt::Opaque, &'b mut std::fmt::Formatter<'c>) -> std::result::Result<(), std::fmt::Error>;
-            let mut _31: &core::fmt::rt::Opaque;
-            scope 8 {
+            debug f => _36;
+            let mut _37: for<'a, 'b, 'c> fn(&'a core::fmt::rt::Opaque, &'b mut std::fmt::Formatter<'c>) -> std::result::Result<(), std::fmt::Error>;
+            let mut _38: &core::fmt::rt::Opaque;
+            scope 12 {
+            }
+        }
+    }
+    scope 13 (inlined format) {
+        debug args => _14;
+        let mut _39: std::option::Option<&str>;
+        let mut _40: &std::fmt::Arguments<'_>;
+        let mut _41: [closure@std::fmt::format::{closure#0}];
+        let mut _42: &std::fmt::Arguments<'_>;
+        scope 14 (inlined Arguments::<'_>::as_str) {
+            debug self => _40;
+            let mut _43: usize;
+            let mut _44: std::num::NonZeroUsize;
+            let mut _45: &str;
+            scope 15 {
+            }
+            scope 16 (inlined NonZeroUsize::get) {
+                debug self => _44;
             }
         }
     }
 
     bb0: {
         StorageLive(_3);
         StorageLive(_4);
         StorageLive(_5);
         StorageLive(_6);
         _25 = const _;
         _6 = _25 as &[&str] (PointerCoercion(Unsize));
         StorageLive(_7);
         StorageLive(_8);
         StorageLive(_9);
         StorageLive(_10);
         StorageLive(_11);
         _11 = &_2;
         StorageLive(_26);
         _26 = <u64 as std::fmt::Display>::fmt as for<'a, 'b, 'c> fn(&'a u64, &'b mut std::fmt::Formatter<'c>) -> std::result::Result<(), std::fmt::Error> (PointerCoercion(ReifyFnPointer));
         StorageLive(_27);
         _27 = _26 as for<'a, 'b, 'c> fn(&'a core::fmt::rt::Opaque, &'b mut std::fmt::Formatter<'c>) -> std::result::Result<(), std::fmt::Error> (Transmute);
         StorageLive(_28);
         _28 = _11 as &core::fmt::rt::Opaque (Transmute);
         _10 = core::fmt::rt::Argument::<'_> { value: move _28, formatter: move _27 };
         StorageDead(_28);
         StorageDead(_27);
         StorageDead(_26);
         StorageDead(_11);
         _9 = [move _10];
         StorageDead(_10);
         _8 = &_9;
         _7 = move _8 as &[core::fmt::rt::Argument<'_>] (PointerCoercion(Unsize));
         StorageDead(_8);
         _5 = Arguments::<'_>::new_v1(move _6, move _7) -> [return: bb1, unwind continue];
     }
 
     bb1: {
         StorageDead(_7);
         StorageDead(_6);
-        _4 = format(move _5) -> [return: bb2, unwind continue];
+        StorageLive(_29);
+        StorageLive(_30);
+        StorageLive(_33);
+        StorageLive(_34);
+        _34 = (_5.0: std::num::NonZeroUsize);
+        _33 = (_34.0: usize);
+        StorageDead(_34);
+        switchInt(move _33) -> [1: bb7, otherwise: bb8];
     }
 
     bb2: {
+        StorageDead(_16);
+        StorageDead(_15);
+        StorageLive(_39);
+        StorageLive(_40);
+        StorageLive(_43);
+        StorageLive(_44);
+        _44 = (_14.0: std::num::NonZeroUsize);
+        _43 = (_44.0: usize);
+        StorageDead(_44);
+        switchInt(move _43) -> [1: bb11, otherwise: bb12];
+    }
+
+    bb3 (cleanup): {
+        drop(_3) -> [return: bb4, unwind terminate(cleanup)];
+    }
+
+    bb4 (cleanup): {
+        resume;
+    }
+
+    bb5: {
+        StorageDead(_30);
+        StorageLive(_31);
+        StorageLive(_32);
+        _32 = &_5;
+        _31 = [closure@/rustc/cf92f86bcc15ebf3a31a9467a890c989b7665512/library/alloc/src/fmt.rs:614:31: 614:33] { 0: move _32 };
+        StorageDead(_32);
+        _4 = Option::<&str>::map_or_else::<String, [closure@format::{closure#0}], for<'a> fn(&'a str) -> <str as ToOwned>::Owned {<str as ToOwned>::to_owned}>(move _29, move _31, <str as ToOwned>::to_owned) -> [return: bb6, unwind continue];
+    }
+
+    bb6: {
+        StorageDead(_31);
+        StorageDead(_29);
         StorageDead(_5);
         StorageDead(_9);
         _3 = move _4;
         StorageDead(_4);
         StorageLive(_12);
         StorageLive(_13);
         StorageLive(_14);
         StorageLive(_15);
         _24 = const _;
         _15 = _24 as &[&str] (PointerCoercion(Unsize));
         StorageLive(_16);
         StorageLive(_17);
         StorageLive(_18);
         StorageLive(_19);
         StorageLive(_20);
         _20 = &_2;
-        StorageLive(_29);
-        _29 = <u64 as std::fmt::Display>::fmt as for<'a, 'b, 'c> fn(&'a u64, &'b mut std::fmt::Formatter<'c>) -> std::result::Result<(), std::fmt::Error> (PointerCoercion(ReifyFnPointer));
-        StorageLive(_30);
-        _30 = _29 as for<'a, 'b, 'c> fn(&'a core::fmt::rt::Opaque, &'b mut std::fmt::Formatter<'c>) -> std::result::Result<(), std::fmt::Error> (Transmute);
-        StorageLive(_31);
-        _31 = _20 as &core::fmt::rt::Opaque (Transmute);
-        _19 = core::fmt::rt::Argument::<'_> { value: move _31, formatter: move _30 };
-        StorageDead(_31);
-        StorageDead(_30);
-        StorageDead(_29);
+        StorageLive(_36);
+        _36 = <u64 as std::fmt::Display>::fmt as for<'a, 'b, 'c> fn(&'a u64, &'b mut std::fmt::Formatter<'c>) -> std::result::Result<(), std::fmt::Error> (PointerCoercion(ReifyFnPointer));
+        StorageLive(_37);
+        _37 = _36 as for<'a, 'b, 'c> fn(&'a core::fmt::rt::Opaque, &'b mut std::fmt::Formatter<'c>) -> std::result::Result<(), std::fmt::Error> (Transmute);
+        StorageLive(_38);
+        _38 = _20 as &core::fmt::rt::Opaque (Transmute);
+        _19 = core::fmt::rt::Argument::<'_> { value: move _38, formatter: move _37 };
+        StorageDead(_38);
+        StorageDead(_37);
+        StorageDead(_36);
         StorageDead(_20);
         _18 = [move _19];
         StorageDead(_19);
         _17 = &_18;
         _16 = move _17 as &[core::fmt::rt::Argument<'_>] (PointerCoercion(Unsize));
         StorageDead(_17);
-        _14 = Arguments::<'_>::new_v1(move _15, move _16) -> [return: bb3, unwind: bb5];
+        _14 = Arguments::<'_>::new_v1(move _15, move _16) -> [return: bb2, unwind: bb3];
+    }
+
+    bb7: {
+        StorageDead(_33);
+        _35 = deref_copy ((_5.1: core::fmt::Parts).0: &str);
+        _29 = Option::<&str>::Some(_35);
+        goto -> bb5;
     }
 
-    bb3: {
-        StorageDead(_16);
-        StorageDead(_15);
-        _13 = format(move _14) -> [return: bb4, unwind: bb5];
+    bb8: {
+        StorageDead(_33);
+        _29 = Option::<&str>::None;
+        goto -> bb5;
     }
 
-    bb4: {
+    bb9: {
+        StorageDead(_40);
+        StorageLive(_41);
+        StorageLive(_42);
+        _42 = &_14;
+        _41 = [closure@/rustc/cf92f86bcc15ebf3a31a9467a890c989b7665512/library/alloc/src/fmt.rs:614:31: 614:33] { 0: move _42 };
+        StorageDead(_42);
+        _13 = Option::<&str>::map_or_else::<String, [closure@format::{closure#0}], for<'a> fn(&'a str) -> <str as ToOwned>::Owned {<str as ToOwned>::to_owned}>(move _39, move _41, <str as ToOwned>::to_owned) -> [return: bb10, unwind: bb3];
+    }
+
+    bb10: {
+        StorageDead(_41);
+        StorageDead(_39);
         StorageDead(_14);
         StorageDead(_18);
         _12 = move _13;
         StorageDead(_13);
         StorageLive(_21);
         StorageLive(_22);
         StorageLive(_23);
         _23 = _2;
         _22 = Rem(move _23, const 100_u64);
         StorageDead(_23);
         _21 = move _22 as u8 (IntToInt);
         StorageDead(_22);
         _0 = Person { first_name: move _3, last_name: move _12, age: move _21 };
         StorageDead(_21);
         StorageDead(_12);
         StorageDead(_3);
         return;
     }
 
-    bb5 (cleanup): {
-        drop(_3) -> [return: bb6, unwind terminate(cleanup)];
+    bb11: {
+        StorageDead(_43);
+        _45 = deref_copy ((_14.1: core::fmt::Parts).0: &str);
+        _39 = Option::<&str>::Some(_45);
+        goto -> bb9;
     }
 
-    bb6 (cleanup): {
-        resume;
+    bb12: {
+        StorageDead(_43);
+        _39 = Option::<&str>::None;
+        goto -> bb9;
     }
 }
-----------------------------

///
/// The placaeholders pointer can be null for default placeholders:
/// a placeholder for each argument once, in order, with default formatting options.
strings_and_placeholders: (*const &'static str, *const rt::Placeholder),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use an enum? A variant for the num_parts == 1 case, and the other for the general case.
That's how the type is used AFAIU.

pub fn new_v1(pieces: &'a [&'static str], args: &'a [rt::Argument<'a>]) -> Arguments<'a> {
if pieces.len() < args.len() || pieces.len() > args.len() + 1 {
panic!("invalid args");
pub fn new_v1(strings: &'a [&'static str], args: &'a [rt::Argument<'a>]) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the posted MIR snippet, would it be worth optimizing new_v1 to be easier to optimize? For instance by checking more invariants in rustc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by checking more invariants in rustc

@cjgillot What do you mean?

@dead-claudia
Copy link

dead-claudia commented Oct 10, 2023

Edit: bring up an optimization opportunity.

@m-ou-se

Have you considered doing a hybrid of structured arrays and bytecode, like this? You can consider this as an alternative to this PR as it's mutually exclusive with some of the proposed functions here.

  • Bytecode string is byte-aligned

  • Count type:

    • 0 = implied/not specified
    • 1 = immediate, expect value
    • 2 = parameter, expect index
  • Part segment:

    • Metadata byte
      • Bit 0: next length is 32-bit
      • Bit 1: has non-default fill char and/or flags
      • Bit 2-3: precision count type
      • Bit 4-5: width count type
      • Bit 6-7: alignment
    • Next length u8/u32
    • 32-bit field for fill char (Unicode values are 21-bit) + flags, if non-default
      • Bit 21: SignPlus
      • Bit 22: SignMinus
      • Bit 23: Alternate
      • Bit 24: SignAwareZeroPad
      • Bit 25: DebugLowerHex
      • Bit 26: DebugUpperHex
      • Bit 30: precision is 32-bit
      • Bit 31: width is 32-bit
    • Precision u8/u32, if precision count type is not "implied"
    • Width u8/u32, if width count type is not "implied"
    • Next data bytes
  • Arguments static data:

    • Static metadata byte:
      • Bit 0: part length is 32-bit
      • Bit 1: start length is 32-bit
    • Part count u8/u32 (before start count to simplify certain tests)
    • Start length u8/u32
    • Start data bytes
    • Part segment sequence
  • Arguments consists of two pointers: bytecode template and arguments start (may be null if no arguments needed). Arguments end is implicitly read from the template.

Multiple things are being suggested here:

  1. The static array is a thin pointer to a dynamically sized byte-aligned static array that holds everything. This would be iterated as a byte sequence, using unaligned reads for everything.

  2. Most flags are stuffed into a couple fields, to save a bunch of space. Notably, the fill character's spare space (Unicode code points only take up 21 bits) is also used for flags.

  3. Default-option placeholders are optimized to just use a simple 0, which while it's slightly larger than complete omission, it's not by much. This avoids the catastrophic placeholder growth issue and makes the bytecode size adaptive.

  4. Arguments would exist in actual use sequence and iterated in one pass, without special indexing. This greatly simplifies iteration (it's just a pointer add) while only bringing slight overhead for multiply-used arguments (that would need saved to multiple stack slots instead of just one).

  5. Lengths are variably encoded with a very simple prefix bit (who here is creating 2^32-byte segments?), cutting size down a lot for the most common case of very small segments.

  6. Lengths are stored inline in the template, relying on cache locality to stay relatively fast.

Emit's considerably more complicated, but it not only trims the runtime size to 2 pointers + arguments stack array for arguments, but also shaves the runtime array sizes so much, it's about the size of a C format string (but much simpler to process).

Here's how it compares with purely static strings:

  • "a{}b{}c" would require 10 bytes static on all platforms.

    • Currently, it's 24 bytes on 32-bit platforms, 48 bytes on 64-bit platforms, assuming pieces and args are all stored statically.
    • The C format string equivalent "a%sb%sc" requires 8 bytes.
  • "a{}b{:.5}c" would require 11 bytes static on all platforms.

    • Currently, it's 88 bytes on 32-bit platforms, 160 bytes on 64-bit platforms, assuming pieces and args are all stored statically.
    • The C format string equivalent "a%sb%.5fc" requires 10 bytes.
  • For both, 16 bytes stack would be used by 32-bit platforms and 32 bytes stack is used by 64-byte platforms for the arguments struct.

    • This is the same as it currently is, ignoring argument passing overhead.
    • C variable arguments requires similar stack overhead, but indirectly. They can't truly escape it.
  • For both, there would only be two pointers total: placeholder and arguments. No other indirection would exist, and no extra length argument needed (as that's encoded in the source string).

This would help in three areas:

  • Code generation for the function call: it's just write args to stack, load template pointer, call format function. Very simple.
  • The fmt::write function implementation itself is so simple, it could be feasibly written in assembly.
  • The format string is very compact, reducing binary size quite a bit.

As for why I'm not also inlining function references, those generally lack stable runtime positions anyways, and so they can't be fully static. One could lazily resolve it, but it's not worth it for such a small number of local functions.

Simple Display/etc. delegation (ex: #104525) could be done by just making the template optional and, if missing, invoking the method with the object directly. This handles that case completely generically. And for simple non-option formatting, the "simple" variant could be used exclusively. You could also pass a null pointer for the <String as Display>::fmt function instead to bypass it and avoid needing to call a method at all.

And yes, this function could, in theory, be made polymorphic over UTF-8 strings and byte strings with few changes. Only real difference is character filling.


Data can be made all 32-bit and unconditionally present if the above variable-length encoding isn't okay.

  • Part segment:

    • Metadata byte
      • Bit 2-3: precision count type
      • Bit 4-5: width count type
      • Bit 6-7: alignment
    • Next length u32
    • 32-bit field for fill char (Unicode values are 21-bit) + flags
      • Bit 21: SignPlus
      • Bit 22: SignMinus
      • Bit 23: Alternate
      • Bit 24: SignAwareZeroPad
      • Bit 25: DebugLowerHex
      • Bit 26: DebugUpperHex
    • Precision u32
    • Width u32
    • Next data bytes
  • Arguments static data:

    • Part count u32
    • Start length u32
    • Start data bytes
    • Part segment sequence

This increases static size overhead to the following:

  • "a{}b{}c" would require 45 bytes static on all platforms, compared to 24 on 32-bit and 48 on 64-bit currently.
  • "a{}b{:.5}c" would require 45 bytes static on all platforms, compared to 88 on 32-bit and 160 on 64-bit currently.

And of course, there is a middle ground, where whole fields are only conditionally present, but lengths are always 32-bit:

  • Part segment:

    • Metadata byte
      • Bit 0: has non-default fill char/custom args
      • Bit 2-3: precision count type
      • Bit 4-5: width count type
      • Bit 6-7: alignment
    • Next length u32
    • 32-bit field for fill char (Unicode values are 21-bit) + flags, if non-default
      • Bit 21: SignPlus
      • Bit 22: SignMinus
      • Bit 23: Alternate
      • Bit 24: SignAwareZeroPad
      • Bit 25: DebugLowerHex
      • Bit 26: DebugUpperHex
    • Precision u32, if precision count type is not "implied"
    • Width u32, if width count type is not "implied"
    • Next data bytes
  • Arguments static data:

    • Part count u32
    • Start length u32
    • Start data bytes
    • Part segment sequence

This results in the following static sizes:

  • "a{}b{}c" would require 21 bytes static on all platforms, compared to 24 on 32-bit and 48 on 64-bit currently.
  • "a{}b{:.5}c" would require 25 bytes static on all platforms, compared to 88 on 32-bit and 160 on 64-bit currently.

@tgross35
Copy link
Contributor

tgross35 commented Oct 31, 2023

@dead-claudia I think most of the brainstorming is at #99012, this PR is for a specific experiment. That being said, the ideas sound great!

One thing is that I don't think we want to use invalid UTF-8 characters to store padding information, see #99012 (comment). Also it seems like it may be a good idea to reserve some bits for future fill options

I don't think there would be any opposition to storing indices as u8, the only downside being the unaligned reads that result.

Are you up for trying this? I'd be interested to see an experiment PR similar to this one that tests out your ideas to see how it stacks up for performance

Emit's considerably more complicated, but it not only trims the runtime size to 2 pointers + arguments stack array for arguments, but also shaves the runtime array sizes so much, it's about the size of a C format string (but much simpler to process).

For anyone's reference, here an older C printf implementation (printf calls vfprintf), rather complex http://sourceware.org/git/?p=glibc.git;a=blob;f=stdio-common/vfprintf.c;h=fc370e8cbc4e9652a2ed377b1c6f2324f15b1bf9;hb=3321010338384ecdc6633a8b032bb0ed6aa9b19a#l196. I'm not entirely sure where the size comes from, tough code to read...

@dead-claudia
Copy link

@tgross35

@dead-claudia I think most of the brainstorming is at #99012, this PR is for a specific experiment. That being said, the ideas sound great!

Yeah, didn't know about that issue. Thanks!

One thing is that I don't think we want to use invalid UTF-8 characters to store padding information, see #99012 (comment). [...]

To be clear, that character + flags field stores a literal Unicode scalar code point, not a UTF-8 character. It can easily accommodate a byte, and technically speaking, that field for bytes could be shrunk down to 8 bits for the fill character, 8 bits for the flags as it stands currently.

[...] Also it seems like it may be a good idea to reserve some bits for future fill options

There's a whole 5 more that could feasibly fit in that character + flags bit set (4 if you want a "continuation" bit). There's room to grow. 🙂

I don't think there would be any opposition to storing indices as u8, the only downside being the unaligned reads that result.

True, but how often is this really being called? Also, while unaligned reads are slower, this one's already very likely in cache, and so in practice there's likely at most 1 memory bus fetch anyways for most platforms. This is also why this person found little difference in practice - most their accesses were already likely pre-fetched into cache.

With architectures like 32-bit ARM that at best only half-support unaligned access, the relative performance might be close enough to not matter, considering the frequent dynamic function calls and the like. Also, with the compact encoding, most small accesses (where unaligned accesses would have the most impact) would do 8-bit loads anyways, so it'd already be 1-byte aligned in the common case. And this can be bypassed if no fill is needed.

Also, LLVM only generates two loads for misaligned loads, so the impact should at worst be 2-3x worse even on those. I do need to inspect their code gen, though.

Worst case scenario (if it causes perf issues), this structure's easy to tweak to ensure proper padding - just pad the string lengths to compensate. For one, the existing bits should probably be moved even now so it all exists on the same byte, but I obviously didn't think that far when laying that out initially. 🙃

Are you up for trying this? I'd be interested to see an experiment PR similar to this one that tests out your idea

I'd love to, but it'll be a while (few months minimum) before I'd have any real time to play around and try to implement this.

@tgross35
Copy link
Contributor

tgross35 commented Nov 1, 2023

To be clear, that character + flags field stores a literal Unicode scalar code point, not a UTF-8 character. It can easily accommodate a byte, and technically speaking, that field for bytes could be shrunk down to 8 bits for the fill character, 8 bits for the flags as it stands currently.
[...]
There's a whole 5 more that could feasibly fit in that character + flags bit set (4 if you want a "continuation" bit). There's room to grow. 🙂

Oh, right, good points. I just read Bit 31: width is 32-bit and assumed the space was full 😄

True, but how often is this really being called? Also, while unaligned reads are slower, this one's already very likely in cache, and so in practice there's likely at most 1 memory bus fetch anyways for most platforms.

Oh, I don't have any perf concerns, the APIs for unaligned things are just messier and can overall be a bit more eh (e.g. #[repr(packed)] is generally disliked for a few reasons). But that may not even materialize here.

Are you up for trying this? I'd be interested to see an experiment PR similar to this one that tests out your idea

I'd love to, but it'll be a while (few months minimum) before I'd have any real time to play around and try to implement this.

Too bad :) maybe somebody will be interested in picking it up as an experiment in the meantime

@tgross35
Copy link
Contributor

tgross35 commented Dec 6, 2023

@dead-claudia you refer to "start data" a few times. What exactly is this? I assume it stores the argument information, but it's not clear to me how it gets processed (perhaps an array of indices that use dynamic data?)

@dead-claudia
Copy link

@dead-claudia you refer to "start data" a few times. What exactly is this? I assume it stores the argument information, but it's not clear to me how it gets processed (perhaps an array of indices that use dynamic data?)

Sorry, should've explained it then. It's the byte contents of the first string segment. For instance, this was a triumph in format_args!("this was a triumph{}i'm making a note here{}huge success", foo, bar).

@Dylan-DPC
Copy link
Member

Closing this as inactive. Feel free to reöpen this pr or create a new pr if you get the time to work on this. Thanks

@Dylan-DPC Dylan-DPC closed this Apr 2, 2024
@Dylan-DPC Dylan-DPC added S-inactive Status: Inactive and waiting on the author. This is often applied to closed PRs. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-fmt Area: `std::fmt` perf-regression Performance regression. S-inactive Status: Inactive and waiting on the author. This is often applied to closed PRs. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.