Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loops with inclusive ranges produce horrible assembly #75035

Closed
MSxDOS opened this issue Aug 2, 2020 · 15 comments
Closed

Loops with inclusive ranges produce horrible assembly #75035

MSxDOS opened this issue Aug 2, 2020 · 15 comments
Assignees
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@MSxDOS
Copy link

MSxDOS commented Aug 2, 2020

This:

extern {
    fn f();
}

pub unsafe fn test_loop() {
    for _ in 1 ..= 7 { 
        f();
    }
}

on 1.42:

example::test_loop:
        push    rbx
        mov     rbx, qword ptr [rip + f@GOTPCREL]
        call    rbx
        call    rbx
        call    rbx
        call    rbx
        call    rbx
        call    rbx
        mov     rax, rbx
        pop     rbx
        jmp     rax

on 1.43 and above:

example::test_loop:
        push    rbp
        push    r15
        push    r14
        push    rbx
        push    rax
        mov     ebx, 1
        mov     r14d, 7
        mov     r15, qword ptr [rip + f@GOTPCREL]
.LBB0_1:
        lea     ebp, [rbx + 1]
        cmp     ebx, 7
        cmovge  ebp, r14d
        call    r15
        cmp     ebp, 7
        jg      .LBB0_3
        cmp     ebx, 7
        mov     ebx, ebp
        jl      .LBB0_1
.LBB0_3:
        add     rsp, 8
        pop     rbx
        pop     r14
        pop     r15
        pop     rbp
        ret

https://godbolt.org/z/n63seh

@bugadani
Copy link
Contributor

bugadani commented Aug 2, 2020

Probably caused by #68835

As a workaround, (1..=7).for_each(|_| { f(); }) seems to be optimized.

@leonardo-m
Copy link

See also #75024

@rustbot rustbot added C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. labels Aug 2, 2020
@ds84182
Copy link

ds84182 commented Aug 5, 2020

Seems like the following generates good assembly:

extern {
    fn f();
}

struct Inclusive(usize, Option<usize>);

impl Inclusive {
    fn new(start: usize, end: usize) -> Self {
        if start > end {
            Inclusive(start, None)
        } else {
            Inclusive(start, Some(end - start))
        }
    }
}

impl Iterator for Inclusive {
    type Item = usize;
    #[inline]
    fn next(&mut self) -> Option<Self::Item> {
        if let Some(count) = self.1 {
            self.1 = count.checked_sub(1);
            let n = self.0;
            if count != 0 {
                self.0 = self.0 + 1;
            }
            Some(n)
        } else {
            None
        }
    }
}

pub unsafe fn test_loop() {
    for _ in Inclusive::new(1, 7) { 
        f();
    }
}

I don't think this matches the behavior of inclusive ranges 1:1, but it seems that using a current value + counter pair optimizes correctly.

@bugadani
Copy link
Contributor

This problem is not there any more in 1.47 beta: https://godbolt.org/z/eTPETc

@leonardo-m
Copy link

This problem is not there any more in 1.47 beta

It needs a wider testing, with double and triple nested loops too.

@MSxDOS
Copy link
Author

MSxDOS commented Oct 11, 2020

Still very bad with a nested loop: https://godbolt.org/z/djY1YW

@bugadani
Copy link
Contributor

bugadani commented Oct 11, 2020

@ds84182

I don't think this matches the behavior of inclusive ranges 1:1, but it seems that using a current value + counter pair optimizes correctly.

Unfortunately, to do this, the values in the range must be at least PartialOrd, which the current Range API does not require.

The simplest solution I found, while keeping the type signatures the same is this, which actually performs worse than the current impl 🤔

@MSxDOS
Copy link
Author

MSxDOS commented Mar 5, 2021

Unfortunately, the upgrade to LLVM 12 didn't fix the nested loop example, but made it even worse.

cc @nikic Could you have a look at this?

@nikic
Copy link
Contributor

nikic commented Mar 5, 2021

Cleaned up IR test case:

define void @test() {
start:
  br label %loop 

loop:                                            
  %iv = phi i32 [ 1, %start ], [ %iv.next, %loop ]
  %iv.inc = add nsw i32 %iv, 1
  tail call void @f()
  %cmp1 = icmp slt i32 %iv, 2
  %iv.next = select i1 %cmp1, i32 %iv.inc, i32 2
  %cmp2 = icmp slt i32 %iv.next, 3
  %and = and i1 %cmp1, %cmp2
  br i1 %and, label %loop, label %exit

exit:                                              
  ret void
}

declare void @f() 

I believe CVP should be capable of inferring that %cmp2 = true, but it ends up with a range of constantrange<-2147483647, 4>, which is one element too large...

@nikic
Copy link
Contributor

nikic commented Mar 5, 2021

Something like this would do it: https://gist.github.com/nikic/1beaff28e66771d39ec949814428005b

@nikic nikic self-assigned this Mar 5, 2021
@nikic
Copy link
Contributor

nikic commented Mar 6, 2021

Fixed upstream by llvm/llvm-project@a917fb8.

@nikic
Copy link
Contributor

nikic commented Aug 22, 2021

After #87570 the nested loop generates good code as well:

define void @_ZN5test29test_loop17he2725c5c3163608aE() unnamed_addr #0 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
start:
  tail call void @f()
  tail call void @f()
  tail call void @f()
  tail call void @f()
  tail call void @f()
  tail call void @f()
  ret void
}

Worth noting that inclusive range loop optimization is very hit and miss, but at least this particular case is handled well now.

@nikic nikic closed this as completed Aug 22, 2021
@leonardo-m
Copy link

A problem isn't solved by the new LLVM. Is this worth reopening this issue or something?

fn euler76() -> u32 {
    const N: usize = 100; // Input.
    let mut ways = [0; N + 1];
    ways[0] = 1;
    for j in 1 .. N {
        //for i in j ..= N { // Adds a bound test.
        for i in j .. N + 1 {
            ways[i] += ways[i - j];
        }
    }
    ways[N]
}

fn main() { assert_eq!(euler76(), 190_569_291); }

@berghetti
Copy link

berghetti commented Dec 22, 2023

Problem remains, even for a simple case.
https://godbolt.org/z/K44MTqb7n

@LingMan
Copy link
Contributor

LingMan commented Dec 22, 2023

@berghetti The original example still optimizes well. I don't think reusing this ticket for similar but separate issues will get them much attention, especially years after this one has been closed.

You'd better open a new report.

Btw, (0..=*num).for_each(|_| { i += *num; }); optimizes as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

No branches or pull requests

8 participants