[AArch64]: 128-bit Sequentially Consistent load allows reordering before prior store when armv8 and armv8.4 implementations are Mixed #81978

lukeg101 · 2024-02-16T09:44:07Z

Consider the following litmus test:

C SB

{ int128_t *x = 0; int128_t *y = 0}

P0 (_Atomic __int128 *x, _Atomic __int128 *y) {
  atomic_store_explicit(x,1,memory_order_seq_cst);
  __int128 r0 = atomic_load_explicit(y,memory_order_seq_cst);
}

P1 (_Atomic __int128 *x, _Atomic __int128 *y) {
  atomic_store_explicit(y,1,memory_order_seq_cst);
  __int128 r0 = atomic_load_explicit(x,memory_order_seq_cst);
}

exists (P0:r0=0 /\ P1:r0 = 0)

where P0:r0 = 0 means thread P0, local variable r0 has value 0.

Building either P0/P1 for v8-a or v8.4 passes as expected. When simulating this test under the C/C++ model from its initial state, the outcome of execution in the exists clause is forbidden by the source model. The allowed outcomes are:

{ P0:r0=0; P1:r0=1; }
{ P0:r0=1; P1:r0=0; }
{ P0:r0=1; P1:r0=1; }

However when compiling P0, to target armv8.4-a (https://godbolt.org/z/dxTrbGxoG) using clang trunk (dmb ish; stp; dmb ish; ldp; dmb ish), compiling the store on P1 to target armv8.0-a using clang (ldaxp;stlxp;cbnz loop), and the load on P1 to target armv8.4-a (ldp;dmb ish) using clang. When compiled the assembly is as follows:

P0:
  MOV X6, #1
  DMB ISH
  STP X5, X6 [reg_containing_x]
  DMB ISH
  LDP X2, X1, [reg_containing_y] 
  DMB ISH
 
P1:
loop:
  LDAXP X5, X6, [reg_containing_y]
  STLXP W4, X5, X6, [reg_containing_y]
  CBNZ W4, loop
  LDP X2, X1, [reg_containing_x]
  DMB ISH

The compiled program has the following outcomes when simulated under the AArch64 model (rename P0:X1 to P0:r0 and P1:X1 to P1:r0 to match with source outcomes):

{ P0:r0=0; P1:r0=0; } <--- Forbidden by source model, bug!
{ P0:r0=0; P1:r0=1; }
{ P0:r0=1; P1:r0=0; }
{ P0:r0=1; P1:r0=1; }

which is due to the fact the effects of LDP on P1 can be reordered before the effects of STLXP on P1 since there is no leading DMB barrier to prevent the reordering.

Since there is no barrier, we propose to fix the bug by adding said barrier before LDP:

DMB ISH; LDP; DMB ISH

Which prevents the buggy outcome under the AArch64 memory model.

Besides using a DMB, it is feasible to use LDAR - making the SC LDP stronger so it no longer reorders with STLXP. Note it is also possible to make STLXP stronger by adding a DMB after the loop (but that's not what is done for other atomic sizes)

I have validated this bug whilst discussing with Wilco from Arm's compiler teams.

This bug would not have been caught in normal execution, but only when multiple implementations are mixed together.

The text was updated successfully, but these errors were encountered:

llvmbot · 2024-02-16T09:44:47Z

@llvm/issue-subscribers-backend-aarch64

Author: Luke Geeson (lukeg101)

Consider the following litmus test: ``` C SB

{ int128_t *x = 0; int128_t *y = 0}

P0 (_Atomic __int128 *x, _Atomic __int128 *y) {
atomic_store_explicit(x,1,memory_order_seq_cst);
__int128 r0 = atomic_load_explicit(y,memory_order_seq_cst);
}

P1 (_Atomic __int128 *x, _Atomic __int128 *y) {
atomic_store_explicit(y,1,memory_order_seq_cst);
__int128 r0 = atomic_load_explicit(x,memory_order_seq_cst);
}

exists (P0:r0=0 /\ P1:r0 = 0)

where `P0:r0 = 0` means thread `P0`, local variable `r0` has value `0`.

Building either P0/P1 for v8-a or v8.4 passes as expected. When simulating this test under the C/C++ model from its initial state, the outcome of execution in the exists clause is forbidden by the source model. The allowed outcomes are:

{ P0:r0=0; P1:r0=1; }
{ P0:r0=1; P1:r0=0; }
{ P0:r0=1; P1:r0=1; }


However when compiling `P0`, to target armv8.4-a (https://godbolt.org/z/dxTrbGxoG) using clang trunk (`dmb ish; stp; dmb ish; ldp; dmb ish`), compiling the `store` on `P1` to target armv8.0-a using clang (`ldaxp;stlxp;cbnz` loop), and the `load` on `P1` to target armv8.4-a (`ldp;dmb ish`) using clang. When compiled the assembly is as follows:

P0:
MOV X6, #1
DMB ISH
STP X5, X6 [reg_containing_x]
DMB ISH
LDP X2, X1, [reg_containing_y]
DMB ISH

P1:
loop:
LDAXP X5, X6, [reg_containing_y]
STLXP W4, X5, X6, [reg_containing_y]
CBNZ W4, loop
LDP X2, X1, [reg_containing_x]
DMB ISH


The compiled program has the following outcomes when simulated under the AArch64 model (rename P0:X1 to P0:r0 and P1:X1 to P1:r0 to match with source outcomes):

{ P0:r0=0; P1:r0=0; } <--- Forbidden by source model, bug!
{ P0:r0=0; P1:r0=1; }
{ P0:r0=1; P1:r0=0; }
{ P0:r0=1; P1:r0=1; }

which is due to the fact the effects of `LDP` on `P1` can be reordered before the effects of `STLXP` on `P1` since there is no leading `DMB` barrier to prevent the reordering.

Since there is no barrier, we propose to fix the bug by adding said barrier before `LDP`:

DMB ISH; LDP; DMB ISH

Which prevents the buggy outcome under the AArch64 memory model.

Besides using a `DMB`, it is feasible to use `LDAR` - making the SC `LDP` stronger so it no longer reorders with `STLXP.` Note it is also possible to make `STLXP` stronger by adding a `DMB` after the loop (but that's not what is done for other atomic sizes)

I have validated this bug whilst discussing with Wilco from Arm's compiler teams.

This bug would not have been caught in normal execution, but only when multiple implementations are mixed together.



</details>

lukeg101 · 2024-02-16T09:52:55Z

@tmatheson-arm @Wilco1 @efriedma-quic

Eli you asked this question (it is archived so I cannot comment on the phab link) https://reviews.llvm.org/D141429#inline-1378324:

can someone at ARM verify if this is the sequence we actually want for sequentially consistent loads with lse2, as opposed to using caspal?

First of all, the views of the authors expressed in this comment are not endorsed by Arm or any other company mentioned. I do not represent Arm.

This is another one of those compiler bugs that arise when we mix compatible implementations of atomics. Wilco and I have been looking into whether concurrency bugs arise when implementations are mixed. We actually found the above when looking back at this Phabricator link.

It turns out there are compiler bugs that non-mixed testing (ie prior work) can miss. When validating that replacing CASPAL with LDP;DMB ISH was a valid replacement I could not find a litmus test at the time (given one set of atomics mappings) that exhibited a compiler bug. This was a best-effort at the time as I had not considered mixing implementations. This shows the difficulty of testing the compilation of concurrency as a problem that cannot be simply addressed by testing atomics mappings in isolation.

I've developed a tool to automatically search for these kinds of 'mixing bugs', given a set of tests and compiler profiles that generate atomics, as input.

lukeg101 · 2024-02-16T10:48:08Z

The following AArch64 litmus test was generated by the tool (and tweaked by me). It can be fed into the memory model tool:

AArch64 SB6

{ int64_t x[2] = {0,0}; int64_t y[2] = {0,0};
  0:X1 = 0; 0:X2 = 1; 0:X3 = x; 
  0:X0 = 0; 0:X4 = 0; 0:X5 = y;
  1:X1 = 0; 1:X2 = 0; 1:X3 = y;
  1:X5 = 0; 1:X6 = 1; 1:X8 = x; }

P0                 | P1                      ;
  DMB ISH          | loop: LDAXP X1, X2, [X3];
  STP X1, X2, [X3] |   STLXP W4, X5, X6, [X3];
  DMB ISH          |   CBNZ W4, loop         ; 
  LDP X4, X0, [X5] |  LDP X4, X0, [X8]       ;
  DMB ISH          |  DMB ISH                ; 

exists (0:X0 = 0 /\ 1:X0 = 0)

Wilco1 · 2024-02-16T12:02:07Z

In GCC we use this:

ldar	tmp0, [x0]	/* Block reordering with Store-Release instr.  */
ldp	res0, res1, [x0]
dmb	ishld

This allows the most reordering while still being correct.

lukeg101 · 2024-03-13T11:04:51Z

Hi @efriedma-quic are you aware of other cases where mixing has caused issues? I'd like to understand a bit more about the Windows case and how it was used, and other cases if you know any?

github-actions bot added the new issue label Feb 16, 2024

lukeg101 added backend:AArch64 and removed new issue labels Feb 16, 2024

taiki-e mentioned this issue Jun 3, 2024

Tracking issue for integer_atomics rust-lang/rust#99069

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AArch64]: 128-bit Sequentially Consistent load allows reordering before prior store when armv8 and armv8.4 implementations are Mixed #81978

[AArch64]: 128-bit Sequentially Consistent load allows reordering before prior store when armv8 and armv8.4 implementations are Mixed #81978

lukeg101 commented Feb 16, 2024

llvmbot commented Feb 16, 2024

lukeg101 commented Feb 16, 2024 •

edited

Loading

lukeg101 commented Feb 16, 2024

Wilco1 commented Feb 16, 2024

lukeg101 commented Mar 13, 2024

[AArch64]: 128-bit Sequentially Consistent load allows reordering before prior store when armv8 and armv8.4 implementations are Mixed #81978

[AArch64]: 128-bit Sequentially Consistent load allows reordering before prior store when armv8 and armv8.4 implementations are Mixed #81978

Comments

lukeg101 commented Feb 16, 2024

llvmbot commented Feb 16, 2024

lukeg101 commented Feb 16, 2024 • edited Loading

lukeg101 commented Feb 16, 2024

Wilco1 commented Feb 16, 2024

lukeg101 commented Mar 13, 2024

lukeg101 commented Feb 16, 2024 •

edited

Loading