+++
title = "CAS, ABA and LL/SC"

[taxonomies]
tags = ["threading", "linux", "arm", "x86", "qemu"]
+++

If this zoo of abbreviations does not mean anything to you then "[fly you
fools][fly-you-fools]" as long as you still can :^)

Jokes aside. I recently had the opportunity to ramp up a new co-worker on the
different semantics of `atomic` instructions (compare and swap `CAS` vs
load-linked store-conditional `LL/SC`), `lock-free` programming and the `ABA`
problem.

In the realm of `lock-free` programming and the `ABA` problem, I shared a story
from some years ago when I was debugging a random memory corruption in our
multi-threaded simulator. Sharing that story actually gave the idea to write
this post as reference for potential new co-workers or all the internet people
<3.

After many hours of debugging the random memory corruption, which *obviously*
did not occur very often, turned out to be a bug in a custom memory allocator.
To be more precise, in a primitive used by the allocator to manage a list of
free memory blocks. The allocator was used from multiple threads concurrently
and hence `pushing` and `popping` free memory blocks had to be thread-safe.

The operations to manage the free block list were implemented with an
`lock-free` algorithm, which we will visit later when analyzing the bug.
Before, let us revisit the fundamental semantics of atomic `compare-and-swap
(CAS)` operations, such that everyone has the same understanding. If you are
already familiar with CAS, feel free to skip ahead to the exploration of the
[lock-free memory block primitive](#lock-free-free-memory-block-primitive).

## compare-and-swap (CAS)

The following pseudo code gives a good mental model for an atomic CAS
operation. The described compare-and-swap is executed atomically without giving
any other execution stream the chance to interfere in the middle of this
operation.
```cpp
// bool cas(T* val, T* expected, T desired);
//
// Does the following operation atomically.
//
bool is_expected = (*val == *expected);
if (is_expected) {
    *val = desired;
} else {
    *expected = *val;
}
return is_expected;
```

Below are some concrete examples using the cpp [`std::atomic`][cpp-std-atomic]
type.
```cpp
// bool std::atomic<T>::compare_exchange_weak(T& expected, T desired);

std::atomic<int> val(42);
int expected = 43;
// val != expected -> expected = val
assert(val.compare_exchange_weak(expected, 1337) == false);
// val == expected -> val = desired
assert(val.compare_exchange_weak(expected, 1337) == true);
assert(val.load() == 1337);
```

When talking about atomics we also need to talk about [memory
ordering][cpp-mem-order].

Memory ordering concerns the ordering of memory accesses, both atomic and
non-atomic, in relation to atomic operations. It also determines how
side-effects, such as unrelated writes for example, are observed across
threads. Essentially, memory ordering sets the rules for optimizers to reorder
instructions and dictates compilers to insert memory barrier instructions,
ensuring the memory ordering guarantees, in cases where the underlying
hardware's memory model is relatively weak and subject to instruction
reordering.

For the purpose of our discussion, we'll focus on the concept of `sequential
consistent` ordering, as it is the most intuitive and also the default in C++
when utilizing std::atomic. When dealing with atomic loads and stores, it
provides the following guarantees:
1. `atomic store`: On the current thread no read / write preceding the *store*
   can be reordered after the *store*. Other threads will observe all writes
   before they observe the atomic store.
1. `atomic load`: On the current thread no read / write following the *load*
   can be reordered before the *load*. The current thread observes all writes
   done on other threads, which happen before the store to the atomic.

In the example below that means, on *Thread 1* the `M = X` can not be reordered
after the `A.store(1)`. When *Thread 2* sees the atomic variable change, it is
guaranteed that is observes the write to M on *Thread 1* and hence `print(M)`
would give X.
```
Thread 1        Thread 2

M = X           if (A.load() == 1)
A.store(1)        print(M)
```

An atomic CAS operations performs and atomic load and store, which means that
no read or write can be reordered across the CAS operation in any direction.

With that, we just scratched the surface of the rabbit hole, but sufficient for
the remaining discussion. The 2017 cppcon talk from [Fedor Pikus "C++ atomics,
from basic to advanced. What do they really do?"][yt-atomics] goes into more
depth and is highly recommended. I also added some more
[references](#references) at the end of the post.

## `lock-free` free memory block primitive

The code below gives the essence of the primitive to manage the free memory
blocks with the goal of being thread-safe. Essentially it is a linked list of
`BlockData::Block` objects where the head pointer to the beginning of the list
is swapped atomically.

Details such as how the list initially is  filled up or how it grows
dynamically at runtime are left out here. The same goes for any sort of error
handling as none of this is relevant for the discussion.

```cpp
struct BlockStack {
  struct Block {
    Block* next{nullptr};
    // ..
  };

  void push(Block* blk) {
    do {
      blk->next = m_head;
    } while (!m_head.compare_exchange_weak(blk->next, blk));
  }

  Block* pop() {
    Block* blk;
    do {
      blk = m_head;
      // eg, if (blk == nullptr) expand with new free Blocks.
    } while (!m_head.compare_exchange_weak(blk, blk->next));
    return blk;
  }

private:
  std::atomic<Block*> m_head{nullptr};
};
```
> The `std::atomic<T>::compare_exchange_weak` overload used has an additional
> argument for the memory ordering with the default value of sequential
> consistent. This ensures in the push() method, that when another thread
> observes the new head pointer the write to *blk->next* before the CAS is also
> visible.

Taking a first look at the `push()` and `pop()` methods, the implementation
looks quite sound.

`push()` inserts the `blk` object at the beginning of the list. That means the
newly inserted block needs to point to the current list head and the list head
pointer must be updated to point to the newly inserted block. The head pointer
is swapped atomically, which only succeeds if no one else has updated the head
pointer in the meantime and it still points to `blk->next`. In case the CAS was
not successful, the insert operation is retried in a loop until the block is
inserted.

`pop()` on the other hand removes the block at the beginning of the list and
sets the second block as the new beginning of the list. This is done by
updating the head pointer to point to the second block entry in the list.
Similar to push, the current head is read once and the head pointer swapped
atomically. This operation is again repeated in a loop until the CAS is
successful.

However the `pop()` method has a subtle difference compared to the `push()`
method as it dereferences the current head block, which makes it vulnerable to
the `ABA problem`. We will see in a moment what the ABA problem is and why it
is called like that, but first let us slightly re-write the pop method to make
it easier to spot the bug.

```cpp, linenos
Block* pop() {
  Block* blk;
  Block* next;
  do {
    blk  = m_head;
    next = blk->next;  // Deref "our" head block.
  } while (!m_head.compare_exchange_weak(blk, next));
  return blk;
}
```

Now let us assume we have two threads *Thread 1* and *Thread 2*. Both threads
access the shared `BlockData` instance which currently contains three free
blocks (A), (B) and (C).

<img src="list-abc.svg">

*Thread 1* calls pop() and gets interrupted between line 6 and 7.
- `blk` points to (A)
- `next` points to (B)

In the meantime *Thread 2* performs the following actions
- pop() -> (A)
- pop() -> (B)
- push((A))

After this, *Thread 2* owns block (B) and the `BlockData` instance contains the
free blocks (A) and (C).

<img src="list-ac.svg">

Next, *Thread 1* is resumed and continues with the atomic swap of `m_head`. The
swap will succeed as `m_head (A) == blk (A)` is true, however this operation
will re-insert block (B) as free block since *Thread 1* read `next = (B)`
before being interrupted.

This leaves us with the following state after *Thread 1* returns from the pop
operation.

<img src="list-b.svg">

We can easily see that this state is catastrophic, as the next pop operation
will return block (B) which is currently used and owned by *Thread 2*.
Additionally, block (C) and the rest of the list is leaked.

The example above describes the `ABA problem`. Which is, the "inner" state of
the block list is altered, by removing (B), but the observable "outer" state is
unchanged, by removing and re-inserting (A) again, hence ABA.

The following program [cas-aba-issue.cc](cas-aba-issue.cc) demonstrates the
above described issue and can be used for further experiments. It can be
compiled as follows.
```
clang++ -o cas-aba-issue cas-aba-issue.cc -latomic
```

So, how can the current implementation be fixed?

The obvious approach would be to rewrite the push() and pop() methods to use a
`lock`.

However, the point of this post is to look at the `lock-free` approaches and
learn about the different atomic instructions.
Therefore, we will discuss two approaches which require specific instructions,
and therefore are only applicable if the underlying hardware supports that
instructions.
1. Generational pointers using `double-word CAS`.
2. `Exclusive` accesses using load-linked store-conditional.

## Generational pointers

In this approach the raw head pointer is replaced with an head object that
holds the raw pointer as well as a counter (the `generation`). The idea being,
when ever the head object is swapped the generation is increased. In practice,
this makes it resilient against the ambiguous raw pointer in the ABA problem,
as in the example discussed above, the head objects would not compare equal due
to a different generation.
> Theoretically, a case can be constructed where the ABA problem occurs, if the
> counter overflows, but that is not really relevant in practice.

For the following discussion we assume that we are in a x86_64 environment.

The `GenerationPtr` will be our new head object, holding the raw pointer as
well as the generation counter.
```cpp
struct GenerationPtr {
  Block*   ptr{nullptr};
  uint64_t gen{0};
};
```

To be able to atomically compare and swap such an object, the host must support
a double-word CAS, and in this specific case a 16 byte CAS. Most x86_64 hosts
support the [`cmpxchg16b`][cmpxchg16b] instruction which allows to atomically
compare and swap 16 byte when combined with the `lock` prefix.

We create ourselves a helper function for the 16 byte CAS based on the legacy
`__sync_bool_compare_and_swap` builtin, as both gcc and clang directly emitted
cmpxchg16b instructions instead of library calls with the `-mcx16` target
option enabled. It is irrelevant for this discussion and makes looking at the
disassembly easier.
```cpp
// bool __sync_bool_compare_and_swap (type *ptr, type oldval, type newval);

template <typename T>
bool cmpxchg16b(T* dest, T oldval, T newval) {
  static_assert(sizeof(T)  == 16, "Required for cmpxchg16b");
  static_assert(alignof(T) == 16, "Required for cmpxchg16b");

  const auto as_u128 = [](T* ptr) {
    return reinterpret_cast<unsigned __int128*>(ptr);
  };
  return __sync_bool_compare_and_swap(as_u128(dest), *as_u128(&oldval),
                                      *as_u128(&newval));
}
```

The codegen with clang 15.0.7 for this function gives the following result on
my machine.

```asm
; generated by clang version 15.0.7 with -mcx16

00000000000016e0 <bool cmpxchg16b<..>(BlockStack::GenerationPtr* dest,
                                      BlockStack::GenerationPtr oldval,
                                      BlockStack::GenerationPtr newval)>:
  ; Per SystemV ABI
  ;   rdi = dest ptr
  ;   rsi = oldval.ptr
  ;   rdx = oldval.gen
  ;   rcx = newval.ptr
  ;   r8  = newval.gen
  16e0:       push   rbx
  16e1:       mov    rbx,rcx    ; shuffle args for cmpxchg16b
  16e4:       mov    rax,rsi    ; shuffle args for cmpxchg16b
  16e7:       mov    rcx,r8     ; shuffle args for cmpxchg16b
  ; CMPXCHG16B
  ;   compare rdx:rax (oldval) with [rdi] (dest*)
  ;   if equal then load rcx:rbx (newval) into [rdi] (dest*)
  16ea:       lock cmpxchg16b OWORD PTR [rdi]
  16ef:       sete   al
  16f2:       pop    rbx
  16f3:       ret
```

With that we can update the type of `BlockStack::m_head` to GenerationPtr and
adapt the push() and pop() methods.

```cpp, linenos
// BlockStack ..

void push(Block* blk) {
  GenerationPtr old_head, new_head;
  do {
    old_head     = m_head;
    blk->next    = old_head.ptr;
    new_head.ptr = blk;
    new_head.gen = old_head.gen + 1;
  } while (!cmpxchg16b(&m_head, old_head, new_head));
}

Block* pop() {
  GenerationPtr old_head, new_head;
  do {
    old_head     = m_head;
    new_head.ptr = old_head.ptr->next;
    new_head.gen = old_head.gen + 1;
  } while (!cmpxchg16b(&m_head, old_head, new_head));
  return old_head.ptr;
}
```
> For the purpose of this discussion we assume that the
> __sync_bool_compare_and_swap builtin acts as full barrier. In practice the
> __sync builtins should not be used anymore, and instead the newer __atomic
> builtins which allow to specify the memory ordering.

A note about the loads of m_head in line 6 and line 16. Those loads are not
atomic and the value of m_head could be updated from another thread in the
midst when loading the values. However if we read an inconsistent m_head the
CAS would fail and we would retry our operation.

The following program [cas-ptr-generation.cc](cas-ptr-generation.cc) provides
the full example. When being run it demonstrates how the _generation pointer_
prevents the ABA problem. It can be compiled as follows.
```
clang++ -o cas-ptr-generation cas-ptr-generation.cc -latomic -mcx16
```

## Load linked store conditional

So far we looked at atomic CAS instructions which have read-modify-write
semantics and are typically found with CISC ISAs. RISC ISAs on the other hand
usually provide load-linked (LL) store-conditional (SC) instruction pairs to
implement atomic accesses. Compared to the compare-and-swap instruction, LL/SC
are two distinct instructions. The LL instruction loads the current value and
marks the memory region for exclusive access, and the SC instruction only
succeeds and updates the memory location if the memory location is still marked
as exclusive at the time of the store-conditional. If there was an update to
the memory location between the LL/SC pair, the exclusive marker is removed.

The pseudo code below gives an example of how atomic operations are typically
implemented with LL/SC instructions. The example shows an atomic increment of a
location in memory.
```c
/// Atomically increment *MEM and return OLD value.
uint64_t atomic_fetch_inc(uint64_t* mem) {
    uint64_t old;
    bool ok;

    do {
        old = ll(mem);           // Load current value and mark for ex access.
        ok  = sc(mem, old + 1);  // Try to store incremented value.
    } while (!ok);

    return old;
}
```

For the remaining discussion we will focus on the ARM architecture, and use the
[A64 instruction set][a64-insn] as I have some ARM silicon at home to run the
examples on.

We can take a look into the [ARM ARM (Architecture Reference
Manual)][arm-arm-a] for the A-profile to find the description of the LL/SC
instructions in the chapter *B2.9 Synchronization and semaphores*. Where
`Load-Exclusive` refers to load-linked and `Store-Exclusive` refers to
store-conditional.
> The model for the use of a Load-Exclusive/Store-Exclusive instruction pair
> accessing a non-aborting memory address x is:
> - The Load-Exclusive instruction reads a value from memory address x.
> - The corresponding Store-Exclusive instruction succeeds in writing back to
>   memory address x only if no other observer, process, or thread has
>   performed a more recent store to address x. The Store-Exclusive instruction
>   returns a status bit that indicates whether the Write Memory succeeded.

The A64 instruction set supports multiple LL/SC instruction pairs for different
operand sizes and with different memory ordering semantics. We will make use of
the [LDXR][a64-ldxr] and [STXR][a64-stxr] instructions, which are exclusive
load from register and store to register instructions without any memory
ordering semantics. We will use explicit `memory barriers` rather than
instruction pairs with `release-acquire` memory ordering semantics, as this
post is not aimed to discuss the different memory orderings, but give an
overview of different atomic instructions.

Table *B2-3 Synchronization primitives and associated instruction, A64
instruction set* in the ARM ARM has a full overview of all the available LL/SC
instruction pairs in the A64 instruction set.

Following code gives a starting point to play with the LL/SC instructions on
ARM, by providing inline assembly wrapper functions to emit the LDXR and STXR
instructions.

```cpp
/// Load exclusive register.
inline uint64_t ldxr(uint64_t* addr) {
  uint64_t ret;
  asm volatile("ldxr %0, %1" : "=r"(ret) : "Q"(*addr) : "memory");
  return ret;
}

/// Store exclusive register.
inline bool stxr(uint64_t* addr, uint64_t val) {
  uint32_t ret;
  asm volatile("stxr %w0, %2, %1" : "=&r"(ret), "=Q"(*addr) : "r"(val) : "memory");
  return ret == 0;
}

int main() {
  uint64_t mem = 42;

  auto T1 = std::thread([&mem]() {
    // Write to exclusive location (does clear exclusive monitor).
    mem = 2222;
    // Full memory barrier.
    __sync_synchronize();
  });

  uint64_t old = ldxr(&mem);

  // Some artificial delay w/o an explicit context switch (eg syscall) as that
  // would clear the exclusive monitor, though it can still be interupted by
  // the scheduler.
  // Delay is "tuned" for my ARM device.
  for (int i = 0; i < (1 << 14); ++i) {
    asm volatile("nop");
  }

  // Full memory barrier.
  __sync_synchronize();

  bool ok = stxr(&mem, 1111);
  printf("old: %lu -> mem: %lu | ok: %d\n", old, mem, ok);

  T1.join();
  return ok ? 0 : 1;
}
```

The full source code with additional comments is available here
[a64-basic-llsc.cc](a64-basic-llsc.cc). It can be compiled and run natively on
an ARM system as shown below.
We run the program in a loop until we hit the case where T1 wins the race and
does the memory store in between the LDXR/STXR pair on the main thread.
```
> g++ -o a64-basic-llsc a64-basic-llsc.cc -lpthread
> while ./a64-basic-llsc; do : ; done
...
old: 42 -> mem: 1111 | ok: 1
old: 42 -> mem: 1111 | ok: 1
old: 42 -> mem: 1111 | ok: 1
old: 42 -> mem: 2222 | ok: 0
```

With those basics covered we can also implement the LL/SC version of the
`BlockStack`. The push() and pop() methods now leverage the exclusive load and
store instructions to atomically update the `m_head` pointer.

```cpp
struct BlockStack {
  struct Block {
    Block* next{nullptr};
    // ..
  };

  void push(Block* blk) {
    do {
      blk->next = load_head_ex();
    } while (!store_head_ex(blk));
  }

  Block* pop() {
    Block *blk, *next;
    do {
      blk  = load_head_ex();
      next = blk->next;
    } while (!store_head_ex(next));
    return blk;
  }

private:
  // Assume 64bit ptr size for ldxr/stxr instructions.
  static_assert(sizeof(void*) == sizeof(uint64_t), "Ptr size miss-match!");

  Block* load_head_ex() const {
    Block* blk = (Block*)ldxr((uint64_t*)&m_head);
    mem_barrier();
    return blk;
  }

  bool store_head_ex(Block* blk) {
    const bool success = stxr((uint64_t*)&m_head, (uint64_t)blk);
    mem_barrier();
    return success;
  }

  Block* m_head{nullptr};
};
```

The full version of this implementation with additional comments as explanation
is available here [a64-llsc.cc](a64-llsc.cc). It can be compiled and run as
follows on some ARM system. Depending on the timing, the example must be run
multiple times to pass the assertions, as all the artificial delays hacked into
the implementation are tuned according to my ARM device. 
```
g++ -o a64-llsc a64-llsc.cc -lpthread
./a64-llsc
```

In fact, passing the assertions in the example above actually does not
guarantee that we drive the two threads and the BlockData object exactly into
the state as in the CAS example, which triggered the ABA problem. In the CAS
example we could use a _sleep()_ to control the exact order of events, but here
we can neither use any _sleep()s_ nor any _atomics_ to synchronize the threads
execution, as any of those would most certainly lead to clearing the exclusive
monitor of the _ldxr_ instruction to read the head pointer.
All we can do is to add some _nop_ delays, hoping the exclusive monitor is not
cleared due to any of those artificial hacks we introduced (or some
interruption by the scheduler), praise the order of events is exactly as we
want them to be and feel happy when the assertions pass.

The sources of the example are good to explain how the stronger guarantees of
the LL/SC instructions prevent the ABA problem. But in practice, it is really
hard to prove that the program ran exactly as we wanted it to :^)

## The END

Personally, I would say that in 99% of the cases, nobody will have to write
code as we did in this post. Of course, there are always exceptions.

Always prefer to write portable and generic algorithms, and start with a lock
rather than a lock-free algorithm. And most important, measure and profile your
code and do not do premature optimizations and use "cool" stuff just for the
sake of using it. Use the right tool for the job and seek for the simplest
solution.

However, we specifically wanted to write code like this to introduce CAS vs
LL/SC atomics and the ABA problem in multi-threaded lock-free algorithms.

Any corrections, improvements or comments are happily welcome, for that find my
mail on [memzero.de][memzero]

## Appendix: QEMU ARM user emulation

Unfortunately, we can not use the `qemu-aarch64` userspace emulator for the
experiments here, as for the ARM guest target, QEMU implements the load / store
exclusive instructions not with the full architectural semantics, but rather a
simplification which seems sufficient for the typical guest images.

When QEMU emulates a _load exclusive_ instruction, it saves the guest address
as well as the data loaded from the guest memory in the cpu state
[CPUARMState::{exclusive_addr, exclusive_val}][qemu-a64-cpu-state]. On a _store
exclusive_, the guest address of the store is compared with the saved
_exclusive_addr_ and the data to be written with the saved _exclusive_val_.
Iff both compare true, the _store exclusive_ is carried out successfully.

The following pseudo code taken from a comment in
[QEMU:translate-a64.c][qemu-a64-ex] shows how QEMU emits _store exclusive_
instructions where `env` refers to the cpu state.
```c
///  gen_store_exclusive(..)
if (env->exclusive_addr == addr && env->exclusive_val == [addr]) {
    [addr] = {Rt};
    {Rd} = 0;
} else {
    {Rd} = 1;
}
env->exclusive_addr = -1;
```

The *exclusive_val* and *exclusive_addr* fields act as _exclusive monitor_.
However, they are only maintained on load / store exclusive instructions and
not in normal store instructions and this is where QEMU weakens the semantics
of the exclusive instructions. A sequence of normal write instructions can
invalidate and restore the state of the "_exclusive monitor_" in the qemu cpu
state.

With everything learned about the ABA problem in this post, one can now
understand why QEMU is a bad idea for running our experiments. QEMU will just
trigger the ABA problem even with our [a64-llsc.cc](a64-llsc.cc) example using
load / store exclusive instructions.

## Abbreviations
- CAS: compare-and-swap
- LL/SC: load-linked store-conditional
- CISC: complex instruction set computer
- RISC: reduced instruction set computer

## Source files
- [cas-aba-issue.cc](cas-aba-issue.cc)
- [cas-ptr-generation.cc](cas-ptr-generation.cc)
- [a64-basic-llsc.cc](a64-basic-llsc.cc)
- [a64-llsc.cc](a64-llsc.cc)

## References
- [std::atomic][cpp-std-atomic]
- [cpp memory ordering][cpp-mem-order]
- [Fedor Pikus "C++ atomics, from basic to advanced. What do they really do?"][yt-atomics]
- [Memory Consistency Models: A Tutorial][mem-order-tutorial]
- [gcc _sync_ builtins][gcc-sync]
- [gcc inline asm: output constraints][gcc-asm-output]
- [gcc inline asm: machine constraints][gcc-asm-machine]
- [x86_64: cmpxchg16b instruction][cmpxchg16b]
- [ARM architecture reference manual A-profile][arm-arm-a]
- [ARM exclusive monitor][a64-ex-monitor]
- [ARM A64: instruction manual][a64-insn]
- [ARM A64: stxr instruction][a64-stxr]
- [ARM A64: ldxr instruction][a64-ldxr]
- [arm64: atomics: fix use of acquire + release for full barrier semantics][linux-a64-patch]
- [QEMU A64 load / store exclusive][qemu-a64-ex]
- [QEMU A64 CPUARMState::{exclusive_addr, exclusive_val}][qemu-a64-cpu-state]

[fly-you-fools]: https://media.giphy.com/media/v1.Y2lkPTc5MGI3NjExMWJpdjFmMGlmdjcyaWZqM21yaXlzb3JlamN6NGJjZmwzdHkxOWlieCZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/74ExXfqsbJ8l2/giphy.gif
[yt-atomics]: https://youtu.be/ZQFzMfHIxng?t=2596
[cpp-mem-order]: https://en.cppreference.com/w/cpp/atomic/memory_order
[cpp-std-atomic]: https://en.cppreference.com/w/cpp/atomic/atomic
[gcc-sync]: https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html
[gcc-asm-machine]: https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html
[gcc-asm-output]: https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Output-Operands
[cmpxchg16b]: https://www.felixcloutier.com/x86/cmpxchg8b:cmpxchg16b
[a64-insn]: https://developer.arm.com/documentation/ddi0602/latest/?lang=en
[a64-stxr]: https://developer.arm.com/documentation/ddi0596/latest/Base-Instructions/STXR--Store-Exclusive-Register-?lang=en
[a64-ldxr]: https://developer.arm.com/documentation/ddi0596/latest/Base-Instructions/LDXR--Load-Exclusive-Register-?lang=en
[a64-ex-monitor]: https://developer.arm.com/documentation/100934/0100/Exclusive-monitors?lang=en
[arm-arm-a]: https://developer.arm.com/documentation/ddi0487/latest/
[memzero]: https://memzero.de
[qemu-a64-ex]: https://elixir.bootlin.com/qemu/v8.1.2/source/target/arm/tcg/translate-a64.c#L2411
[qemu-a64-cpu-state]: https://elixir.bootlin.com/qemu/v8.1.2/source/target/arm/cpu.h#L687
[mem-order-tutorial]: https://www.cs.utexas.edu/~bornholt/post/memory-models.html
[linux-a64-patch]: https://patchwork.kernel.org/project/linux-arm-kernel/patch/1391516953-14541-1-git-send-email-will.deacon@arm.com/