Atomic operations cheatsheet

What does the processor do?

Load/store reordering

In very broad strokes: Some examples: AMD x64, Vol. 2, section 7.2.; ARMv8-A, section B2.3.12; ARMv8-A, section K13; A Tutorial Introduction to the ARM and POWER Relaxed Memory Models.

Barriers

On x64, these barriers are available (but they are rarely needed, and I haven't found anybody posting an example where anything except MFENCE is needed): On ARM, these barriers are available. Typically only DMB is needed in user-space code. AMD x64, Vol. 1, section 3.9.2.; ARM: Barrier Litmus Tests and Cookbook; Memory barriers are like source control operations; Who ordered memory fences on x86?.

Single copy atomicity

AMD x64, Vol. 2, section 7.3.2.; ARMv8-A, section B2.2.

Special CPU instructions

On x64, the LOCK prefix allows some load-modify-store instructions to be performed atomically. This includes ADD, SUB, INC, DEC, AND, OR, CMPCXHG, and CMPXCHG16B. The XCHG instruction implicitly has a lock-prefix if it accesses memory.

On ARM, starting with v7-M and v7-A, there are store/load-exclusive instructions, which can be used to write load-compare-store-loops. First you load, then you modify in register, then you store, and the store fails if somebody else stored between your load and store. Note that store/load-exclusive instructions on their own don't introduce memory barriers.

On ARMv7, the instructions are LDREX and STREX. There also is LDREXD/STREXD for 64-bit register atomic updates. On ARMv8, they are LDXR and STXR. There also is LDXP/STXP for 128-bit atomic updates.

Starting with ARMv-8, there are load-acquire instructions with A-suffix and store-release instructions L-suffix. There are both regular variants (e.g. LDAR) and execlusive variants (e.g. STLXR). These act as one-way barriers. Memory accesses after an acquire-barrier can't complete before the barrier. Memory accesses before a release-barrier must complete before the barrier.

Additionally, there is the processor-consistent acquire AP-suffix (e.g. LDAPR). I haven't understood what this is for yet.

AMD x64, Vol. 3, section 1.2.5; ARMv7-M, section A3.4; ARMv8-A, section B2.3.12; Load-Acquire and Store-Release instructions; Old New Thing on ARM barriers.

Brief ARM architecture overview

In general, the memory model rules apply to all ARM architecture versions. However, the newer architectures have more instructions to support multithreaded programming. My understanding is that on older architectures, pipelines were smaller and thus full barriers were less expensive relative to other instructions, so there was less of a need for more advanced synchronization mechanisms.

32-bit,
microcontrollers
ARMv6-MCortex-M0 / M0+ / M1
ARMv7-MCortex-M3 / M4 / M7
ARMv8-MCortex-M33 / M35P / M55 / M85
32-bit,
application
ARMv7-ACortex-A5 / A7 / A8 / A9 / A12 / A15 / A17
64-bit,
application
ARMv8-ACortex-A32* / A34 / A35 / A53 / A57 / A72 / A73
ARMv8.2-ACortex-A55 / A65 / A75 / A76 / A77 / A78
ARMv9-ACortex-A510 / A710 / A715

*All ARMv8-A cortex cores support 64-bit, except the Cortex-A32 which is 32-bit only.

The Raspberry Pi 2 is a Cortex-A7 (ARMv7-A), the Raspberry Pi 3 is a Cortex-A53 (ARMv8-A), and the Raspberry Pi 4 is a Cortex-A72 (ARMv8-A).

Wikipedia: Cortex-M and Cortex-A

What does the compiler do?

Volatile

In C and C++, the volatile keyword on a variable prevents the compiler from removing stores or loads to that variable. Commonly used to prevent the compiler from optimizing while (x) to while (1).

This only helps prevent compiler optimizations. It does not cause any barriers or special instructions to be generated. All the caveats of processor behaviour still apply.

<stdatomic.h> in C11 and <atomic> in C++11

Using these libraries prevent both compiler optimizations, insert necessary barrier instructions and generate special instructions as appropriate. You don't need to declare variables as volatile when using either atomic functions or types.

Generally, one doesn't have to worry about specific CPU details to write correct code when using these libraries, though knowing about CPU details can help write more performant code.

In C, there are atomic types, e.g. atomic_int. For these types, load-modify-store operators (e.g. ++ or +=) are atomic. But using these operators give less control than using the functions.

In C, there are functions such as atomic_fetch_add which take a pointers to volatile objects. Since T * implicitly casts to volatile T *, you still don't actually need to declare variables as volatile to use these functions.

In C++, the atomic types are defined as having overloaded operators to provide the same operations as for the types in C. The atomic functions take std::atomic<T> * as a parameter, so you either need to declare your variables as such, or cast when calling atomic functions. This prevents accidentally performing a non-atomic operation on an atomic variable.

Both in C and C++, the atomic functions allow specifying a memory order, e.g. memory_order_acquire. Acquire-semantics prevent memory accesses which come later from being moved in front of the acquire-access. Release-semantics require that memory accesses which came earlier complete before the the release-access. See this blog post for more details.

See cppreference (C++) and cppreference (C).

Windows

Windows provides Interlocked functions. For example InterlockedCompareExchange. Additionally, there are intrinsic, which are prefixed with an underscore, e.g. _InterlockedIncrement. In practice, both seem to produce identical code.

Linux kernel

The Linux kernel defines it's own set of primitives and helpers. Here are some links. Use bootlin to find usage examples.