Implement a specialized version std::iter::Enumerate for TrustedLen #77822

sdroege · 2020-10-11T15:59:51Z

This allows to hint at the compiler via intrinsics::assume() that the
returned index is smaller than the length of the underlying iterator and
allows the compiler to optimize code better based on that.

Fixes #75935

A benchmark comes in a bit.

rust-highfive · 2020-10-11T15:59:55Z

r? @shepmaster

(rust_highfive has picked a reviewer for you, use r? to override)

sdroege · 2020-10-11T16:24:34Z

Benchmark for the function from the issue with the same implementation as this PR outside of std.

It's around 17.6% faster on this function.

test tests::bench_test_0 ... bench:   3,517,733 ns/iter (+/- 585,186)
test tests::bench_test_1 ... bench:   2,991,140 ns/iter (+/- 239,283)

#![feature(test)]
#![feature(specialization)]
#![feature(trusted_len)]
#![feature(core_intrinsics)]

#![allow(soft_unstable)]

extern crate test;

pub fn test_1(x: &[u32], y: &[u32]) -> u32 {
    let mut sum = 0;
    let chunk_size = y.len();
    for (c, y) in Enumerate::new(y.iter()) {
        for chunk in x.chunks_exact(chunk_size) {
            sum += chunk[c] + y;
        }
    }
    sum
}

pub fn test_0(x: &[u32], y: &[u32]) -> u32 {
    let mut sum = 0;
    let chunk_size = y.len();
    for (c, y) in y.iter().enumerate() {
        for chunk in x.chunks_exact(chunk_size) {
            sum += chunk[c] + y;
        }
    }
    sum
}

struct Enumerate<I> {
    iter: I,
    count: usize,
    len: usize,
}

impl<I: Iterator> Enumerate<I> {
    #[inline]
    fn new(iter: I) -> Self {
        EnumerateImpl::new(iter)
    }
}

impl<I> Iterator for Enumerate<I>
where
    I: Iterator,
{
    type Item = (usize, <I as Iterator>::Item);

    #[inline]
    fn next(&mut self) -> Option<(usize, <I as Iterator>::Item)> {
        EnumerateImpl::next(self)
    }
}

// Enumerate specialization trait
#[doc(hidden)]
trait EnumerateImpl<I> {
    type Item;
    fn new(iter: I) -> Self;
    fn next(&mut self) -> Option<(usize, Self::Item)>;
}

impl<I> EnumerateImpl<I> for Enumerate<I>
where
    I: Iterator,
{
    type Item = I::Item;

    #[inline]
    default fn new(iter: I) -> Self {
        Enumerate {
            iter,
            count: 0,
            len: 0, // unused
        }
    }

    #[inline]
    default fn next(&mut self) -> Option<(usize, I::Item)> {
        let a = self.iter.next()?;
        let i = self.count;
        // Possible undefined overflow.
        self.count += 1;
        Some((i, a))
    }
}

impl<I> EnumerateImpl<I> for Enumerate<I>
where
    I: std::iter::TrustedLen + ExactSizeIterator + Iterator,
{
    #[inline]
    fn new(iter: I) -> Self {
        let len = iter.size_hint().0;

        Enumerate {
            iter,
            count: 0,
            len,
        }
    }

    #[inline]
    fn next(&mut self) -> Option<(usize, I::Item)> {
        let a = self.iter.next()?;
        unsafe { std::intrinsics::assume(self.count < self.len); }
        let i = self.count;
        // Possible undefined overflow.
        self.count += 1;
        Some((i, a))
    }
}

#[cfg(test)]
mod tests {
    use super::*;
    use test::Bencher;

    #[bench]
    fn bench_test_0(b: &mut Bencher) {
        let x = vec![0; 4 * 1024 * 1024];
        let y = vec![0; 4 * 1024 * 1024];
        b.iter(|| {
            test::black_box(test_0(test::black_box(&x), test::black_box(&y)));
        });
    }

    #[bench]
    fn bench_test_1(b: &mut Bencher) {
        let x = vec![0; 4 * 1024 * 1024];
        let y = vec![0; 4 * 1024 * 1024];
        b.iter(|| {
            test::black_box(test_1(test::black_box(&x), test::black_box(&y)));
        });
    }
}

library/core/src/iter/adapters/mod.rs

GuillaumeGomez · 2020-10-16T08:09:41Z

cc @rust-lang/libs

GuillaumeGomez · 2020-10-16T08:12:58Z

@pietroalbini Do we have a benchmark tool for such changes?

pietroalbini · 2020-10-16T09:22:36Z

We can see if this affects rustc's speed with perf, but otherwise I'm not aware of other benchmarking tooling.

sdroege · 2020-10-29T08:14:03Z

Is there anything I can do to help moving this forward?

shepmaster · 2020-11-17T02:35:43Z

r? @m-ou-se

bors · 2020-11-23T02:38:58Z

☔ The latest upstream changes (presumably #79319) made this pull request unmergeable. Please resolve the merge conflicts.

Note that reviewers usually do not review pull requests until merge conflicts are resolved! Once you resolve the conflicts, you should change the labels applied by bors to indicate that your PR is ready for review. Post this as a comment to change the labels:

@rustbot modify labels: +S-waiting-on-review -S-waiting-on-author

sdroege · 2020-11-23T10:46:05Z

umbrella The latest upstream changes (presumably #79319) made this pull request unmergeable. Please resolve the merge conflicts.

Should be all good again.

m-ou-se

Thanks for your PR, and sorry for the delay.

This change adds quite a lot of non-trivial code to fix a seemingly small optimization problem. In addition, it makes the Enumerate struct bigger with a field that's often unused, which can be noticable in some situations.

I agree #75935 is a problem that should be fixed, but I'm a bit uncomfortable with a solution/workaround like this.

Do similar problems come up in other languages that use LLVM, and how is it handled there? Does LLVM maybe already have an open issue for a similar situation in e.g. C++ (where pointer pairs are also the common representation of ranges)?

library/core/src/iter/adapters/enumerate.rs

m-ou-se · 2020-12-30T16:30:37Z

library/core/src/iter/adapters/enumerate.rs

+ // SAFETY: the caller must uphold the contract for
+ // `Iterator::__iterator_get_unchecked`.
+ let value = unsafe { try_get_unchecked(&mut self.iter, idx) };
+ let idx = Add::add(self.count, idx);


Why does this use Add::add instead of +? If this is necessary, a comment explaining why would be useful.

That's how it was originally and is everywhere else in this file. I don't know why and was wondering the same, but kept it because there presumably was a reason

The Add::add delays decision about overflow check mode to the crate that does the code generation, even if overflow checks are disabled in the current crate, which they usually are for the standard library.

According to #81721, we should be using #[rustc_inherit_overflow_checks] with a regular + operator here instead of the Add:add trick.

sdroege · 2020-12-31T09:36:46Z

Thanks for your PR, and sorry for the delay.

Thanks for the review and no problem at all!

This change adds quite a lot of non-trivial code to fix a seemingly small optimization problem.

It doesn't actually add new code, it rather duplicates already existing code to insert intrinsics::assume() in strategic places. Arguably worse :)

Would it be more acceptable if I refactored this into a macro that generates both copies of the code and inserts the assume() in one of the two instantiations of the macro?

In addition, it makes the Enumerate struct bigger with a field that's often unused, which can be noticable in some situations.

Indeed, that's a potential problem. I wonder if this can be avoided in some clever way? Associated types don't play well with specialization I assume, so we can't have it map to () in the general case and bool in the specialized case?

However this is the same already for the Zip implementations. The generic one keeps an unused boolean around that is only used by the TrustedRandomAccess-specialized one.

Do similar problems come up in other languages that use LLVM, and how is it handled there? Does LLVM maybe already have an open issue for a similar situation in e.g. C++ (where pointer pairs are also the common representation of ranges)?

I'll try reproducing with with clang++ and iterators, doesn't seem too easy though as C++ doesn't have such a rich API with lots of pre-defined iterators.

m-ou-se · 2020-12-31T10:41:19Z

Associated types don't play well with specialization I assume

Indeed. Not yet, at least. :(

The main reason I'm a bit reluctant to merge this, is that this is a compiler/llvm/optimization problem, so a workaround in the library like this should only be a solution if the root problem isn't going to be fixed any time soon instead. If this was just adding an assume() line or two, it wouldn't be much of a problem. But specializations like this make it significantly harder to maintain this code, and make it much easier to add mistakes in the future. I do realize we already have plenty of specializations like this for performance reasons, but I'd first like to understand if this might already be fixed soon by a compiler (llvm? mir? ..?) patch, before we reach for a workaround like this.

sdroege · 2020-12-31T10:51:06Z

Makes sense!

I'd expect this to be a general problem in how llvm optimizes such code, but it seems like something that is much easier to hit in Rust code than in C++ (no implicit bounds checks by default, iterator chaining not common, ...). If I implement something like the chunks_exact() and enumerate() plus bounds-checking array accessors in C++ I'm 99% sure it would behave the same, but probably no C++ programmer would write such code in practice.

I'll try to come up with a testcase that can be reported to llvm

JohnCSimon · 2021-01-19T05:54:15Z

ping form triage
@sdroege - can you please post the status of this PR? thank you.

sdroege · 2021-01-19T10:34:47Z

@sdroege - can you please post the status of this PR? thank you.

@JohnCSimon No news yet, it's still on my list of things to do but will probably take a while. Doing a C++ testcase for clang++ will require writing quite a bit of C++ boilerplate :)

sdroege · 2021-01-30T08:41:53Z

Ok, I actually tried implementing basically the same as what the iterators would do in plain C without any abstraction noise like iterators on top and it behaves exactly the same. Both with clang and gcc.

See https://godbolt.org/z/TrzW5h .

The assert() in foo1() stays, both assert()s in foo2() are optimized away. Adding a __builtin_assume(c < y_len); at the top of the outer loop's body (which is basically what this PR here does) allows the assertion to get optimized away.

#include <assert.h>
#include <stdint.h>
#include <stdlib.h>

uint32_t foo1(const uint32_t *x, size_t x_len, const uint32_t *y, size_t y_len) {
  uint32_t sum = 0;
  size_t chunk_size = y_len;

  const uint32_t *y_end = y + y_len;
  size_t c = 0;
  for (const uint32_t *y_iter = y; y_iter != y_end; y_iter++, c++) {
    size_t chunks_len = x_len - (x_len % chunk_size);

    for (const uint32_t *chunks_iter = x; chunks_len >= chunk_size; chunks_iter += chunks_len, chunks_len -= chunk_size) {
      uint32_t val;

      assert(c < chunk_size);
      val = chunks_iter[c];

      sum += val + *y_iter;
    }
  }

  return sum;
}

uint32_t foo2(const uint32_t *x, size_t x_len, const uint32_t *y, size_t y_len) {
  uint32_t sum = 0;
  size_t chunk_size = y_len;

  for (size_t c = 0; c < chunk_size; c++) {
    uint32_t y_iter;
    size_t chunks_len;

    assert(c < y_len);
    y_iter = y[c];
    
    chunks_len  = x_len - (x_len % chunk_size);
    for (const uint32_t *chunks_iter = x; chunks_len >= chunk_size; chunks_iter += chunks_len, chunks_len -= chunk_size) {
      uint32_t val;

      assert(c < chunk_size);
      val = chunks_iter[c];

      sum += val + y_iter;
    }
  }

  return sum;
}

@m-ou-se Do you think this is worth reporting to LLVM in this form? Nobody would write such code in C, but that's basically what the Rust iterators do :)

sdroege · 2021-01-30T10:40:37Z

Well of course that can be simplified to just the following, which is more realistic C code. And as bonus, the same thing with C++ iterators on std::vector.

In theory all the assertions in this code should be optimized away, but none are. I'll report that to LLVM then.

#include <assert.h>
#include <stdint.h>
#include <stdlib.h>

#include <vector>

void foo1(const uint32_t *y, size_t y_len) {
  const uint32_t *y_end = y + y_len;
  size_t c = 0;
  for (const uint32_t *y_iter = y; y_iter != y_end; y_iter++, c++) {
    assert(c < y_len);
  }
  assert(c == y_len);
}

void foo2(const std::vector<uint32_t>& y) {
    size_t c = 0;
    for (auto y_iter: y) {
        assert(c < y.size());
        c++;
    }
    assert(c == y.size());
}

void foo3(const std::vector<uint32_t>& y) {
    size_t c = 0;
    for (auto y_iter = y.cbegin(); y_iter != y.cend(); y_iter++, c++) {
        assert(c < y.size());
    }
    assert(c == y.size());
}

And the minimal version in Rust too

pub fn foo(y: &[u32]) {
    let mut x = 0;
    for (c, _y) in y.iter().enumerate() {
        assert!(c < y.len());
        x = c;
    }
    assert!(x == y.len());
}

sdroege · 2021-01-30T12:56:56Z

Reported at https://bugs.llvm.org/show_bug.cgi?id=48965 . Should we close this one here then, or ...?

m-ou-se

@sdroege Nice! Thanks a lot for investigating and filing that report!

Should we close this one here then, or ...?

We could either wait for a future version of LLVM with that issue fixed to end up in rustc, or we could go ahead and merge this PR for now and remove this specialization again when it's no longer necessary. Merging this now means we don't have to wait, but it does make this part of the code a bit harder to maintain. I'd be fine with that, as long as we make sure to document (in a comment) what the specialization is for and when it can be removed (with a link to that llvm issue), to make sure this doesn't stick around forever. What do you prefer?

library/core/src/iter/adapters/enumerate.rs

sdroege · 2021-01-30T15:48:02Z

What do you prefer?

Well, I'd prefer if it was merged as I expect this to take quite a while to fix in LLVM. If it helps anything, I'd be happy to help with reviewing any PRs in the future in this and related code to keep the maintenance burden lower on your side.

I'll update the PR accordingly now and leave the final decision to you. I'll also add another commit on top of the two open discussions above, they already apply to the old code and would be useful to document in the code.

This allows to hint at the compiler via `intrinsics::assume()` that the returned index is smaller than the length of the underlying iterator and allows the compiler to optimize code better based on that. Fixes rust-lang#75935

… Enumerate implementation

…ems fits into an usize

sdroege · 2021-01-30T16:50:23Z

Updated accordingly (and rebased). No functional changes other than constraining the specialization to TrustedRandomAccess and otherwise just various additional comments.

m-ou-se · 2021-01-30T17:29:37Z

library/core/src/iter/adapters/enumerate.rs

+// This could be bound to `TrustedLen + ExactSizeIterator` or `TrustedRandomAccess` to guarantee
+// that the number of elements fits into an `usize` and that the returned length is actually the
+// real length. `TrustedRandomAccess` was selected because specialization on `ExactSizeIterator` is
+// not possible (yet?).


@matthewjasper Could you comment on the possibilities here? Specializing on ExactSizeIterator would be useful here, but is not accepted. I see #[rustc_specialization_trait] and #[rustc_unsafe_specialization_marker] being used in some places to work around this, but I don't know in which cases those can be used.

Considering that TrustedLen is unsafe + unstable + rustc_unsafe_specialization_marker it is more or less implied that any type that implements it is some iterator owned by the standard library. Which means there are no lifetime shenanigans that would make ExactSizeIterator unsafe for min_specialization.
Using that justification a rustc_unsafe_specialization_marker helper trait inheriting from both should be fine. Probably.

On the other hand TrustedRandomAccess is extending its tendrils into ever more places, so soon it should be almost equivalent in power to TrustedLen + ExactSizeIterator.

TrustedRandomAccess seems like it would always be more specific than TrustedLen+ExactSizeIterator. Implementing the former on things like BTreeMap would probably be not a good idea, while TrustedLen+ExactSizeIterator would be easy.

Then yeah, I think that should be possible. But it might be good to get a second opinion from someone who has better understanding of the min_specialization rules.

Considering the progress that happened in the LLVM bug report and all the issues found recently involving TrustedRandomAccess, maybe let's just wait this out until LLVM handles it correctly?

What do you think @m-ou-se ?

bors · 2021-02-22T07:05:11Z

☔ The latest upstream changes (presumably #81732) made this pull request unmergeable. Please resolve the merge conflicts.

crlf0710 · 2021-03-12T16:59:28Z

Ping from triage: @sdroege Hi, according to the discussion above, i'd like to switch this pr to S-Blocked state, but that needs an actual issue for this pr to block on. Is it possible to create such an issue that describes exactly what issue needs to be solved before this pr become possible to land?

sdroege · 2021-03-12T18:19:59Z

@crlf0710 https://bugs.llvm.org/show_bug.cgi?id=48965 would need to be solved and Rust would have to be updated to an LLVM version including it.

m-ou-se · 2021-03-27T11:34:29Z

maybe let's just wait this out until LLVM handles it correctly?

@sdroege Sure. (Thanks for working on this!)

rust-highfive assigned shepmaster Oct 11, 2020

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Oct 11, 2020

sdroege force-pushed the enumerate-specialization branch from 54a6861 to 49f8381 Compare October 11, 2020 16:27

sdroege commented Oct 11, 2020

View reviewed changes

library/core/src/iter/adapters/mod.rs Outdated Show resolved Hide resolved

sdroege commented Oct 11, 2020

View reviewed changes

library/core/src/iter/adapters/mod.rs Outdated Show resolved Hide resolved

sdroege force-pushed the enumerate-specialization branch 2 times, most recently from 5f1ebd5 to cb9db68 Compare October 11, 2020 16:55

jyn514 added I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Oct 11, 2020

rust-highfive assigned m-ou-se and unassigned shepmaster Nov 17, 2020

sdroege mentioned this pull request Nov 20, 2020

Split each iterator adapter and source into individual modules #77697

Merged

sdroege force-pushed the enumerate-specialization branch 2 times, most recently from 27d8cee to fb23034 Compare November 23, 2020 10:45

sdroege force-pushed the enumerate-specialization branch from fb23034 to 14c07d8 Compare November 23, 2020 11:02

m-ou-se reviewed Dec 30, 2020

View reviewed changes

m-ou-se added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Dec 30, 2020

sdroege mentioned this pull request Jan 30, 2021

slice::iter() does not preserve number of iterations information for optimizer causing unneeded bounds checks #75935

Open

m-ou-se reviewed Jan 30, 2021

View reviewed changes

library/core/src/iter/adapters/enumerate.rs Outdated Show resolved Hide resolved

sdroege added 3 commits January 30, 2021 18:39

Add explanation why Add::add() is called directly instead of + in the…

44c5554

… Enumerate implementation

Document that TrustedRandomAccess also requires that the number of it…

49b8f87

…ems fits into an usize

sdroege force-pushed the enumerate-specialization branch from 14c07d8 to 49b8f87 Compare January 30, 2021 16:40

m-ou-se reviewed Jan 30, 2021

View reviewed changes

m-ou-se closed this Mar 27, 2021

sdroege deleted the enumerate-specialization branch May 6, 2021 07:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a specialized version std::iter::Enumerate for TrustedLen #77822

Implement a specialized version std::iter::Enumerate for TrustedLen #77822

sdroege commented Oct 11, 2020 •

edited

Loading

rust-highfive commented Oct 11, 2020

sdroege commented Oct 11, 2020

GuillaumeGomez commented Oct 16, 2020

GuillaumeGomez commented Oct 16, 2020

pietroalbini commented Oct 16, 2020

sdroege commented Oct 29, 2020

shepmaster commented Nov 17, 2020

bors commented Nov 23, 2020

sdroege commented Nov 23, 2020

m-ou-se left a comment

m-ou-se Dec 30, 2020

sdroege Dec 31, 2020

tmiasko Dec 31, 2020

m-ou-se Feb 3, 2021

sdroege commented Dec 31, 2020 •

edited

Loading

m-ou-se commented Dec 31, 2020

sdroege commented Dec 31, 2020

JohnCSimon commented Jan 19, 2021

sdroege commented Jan 19, 2021

sdroege commented Jan 30, 2021

sdroege commented Jan 30, 2021 •

edited

Loading

sdroege commented Jan 30, 2021

m-ou-se left a comment

sdroege commented Jan 30, 2021

sdroege commented Jan 30, 2021

m-ou-se Jan 30, 2021

the8472 Feb 1, 2021

sdroege Feb 2, 2021

the8472 Feb 2, 2021

sdroege Feb 21, 2021

bors commented Feb 22, 2021

crlf0710 commented Mar 12, 2021

sdroege commented Mar 12, 2021

m-ou-se commented Mar 27, 2021 •

edited

Loading

Implement a specialized version std::iter::Enumerate for TrustedLen #77822

Implement a specialized version std::iter::Enumerate for TrustedLen #77822

Conversation

sdroege commented Oct 11, 2020 • edited Loading

rust-highfive commented Oct 11, 2020

sdroege commented Oct 11, 2020

GuillaumeGomez commented Oct 16, 2020

GuillaumeGomez commented Oct 16, 2020

pietroalbini commented Oct 16, 2020

sdroege commented Oct 29, 2020

shepmaster commented Nov 17, 2020

bors commented Nov 23, 2020

sdroege commented Nov 23, 2020

m-ou-se left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sdroege commented Dec 31, 2020 • edited Loading

m-ou-se commented Dec 31, 2020

sdroege commented Dec 31, 2020

JohnCSimon commented Jan 19, 2021

sdroege commented Jan 19, 2021

sdroege commented Jan 30, 2021

sdroege commented Jan 30, 2021 • edited Loading

sdroege commented Jan 30, 2021

m-ou-se left a comment

Choose a reason for hiding this comment

sdroege commented Jan 30, 2021

sdroege commented Jan 30, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bors commented Feb 22, 2021

crlf0710 commented Mar 12, 2021

sdroege commented Mar 12, 2021

m-ou-se commented Mar 27, 2021 • edited Loading

sdroege commented Oct 11, 2020 •

edited

Loading

sdroege commented Dec 31, 2020 •

edited

Loading

sdroege commented Jan 30, 2021 •

edited

Loading

m-ou-se commented Mar 27, 2021 •

edited

Loading