Atomic operations

What does the processor do?

Load/store reordering

In very broad strokes:

On ARM, loads and stores can be reordered arbitrarily.
On x64, the only allowed reordering is doing loads from later instructions before stores from earlier instructions complete.

Some examples:

Loads can be reordred to run before preceding stores. Both on x64 and ARM it is possible to get x=0 and y=0.
Thread 1
*a = 1; int x = *b;
Thread 2
*b = 1; int y = *a;
On x64, stores are not re-ordered with other stores, loads are not re-ordered with other loads. Thread 2 sees either x=0, y=0 or x=1, y=1.

On ARM, it is possible to get x=1, y=0 or x=0, y=1. A DMB barrier is needed between the instruction pairs in both threads.
Thread 1
*a = 1; *b = 1;
Thread 2
int x = *b; int y = *a;
Within one thread, dependencies are tracked, so loads and reads to the same location can't be reoredred relative to each other.

AMD x64, Vol. 2, section 7.2.; ARMv8-A, section B2.3.12; ARMv8-A, section K13; A Tutorial Introduction to the ARM and POWER Relaxed Memory Models.

Barriers

On x64, these barriers are available (but they are rarely needed, and I haven't found anybody posting an example where anything except MFENCE is needed):

LFENCE — Loads can't be reordered across the barrier.
SFENCE — Stores can't be reordered across the barrier.
MFENCE — Neither stores nor loads can be reordered across the barrier.
Instructions with the LOCK prefix act as a MFENCE.

On ARM, these barriers are available. Typically only DMB is needed in user-space code.

DMB — Neither stores nor loads can be reorered across the barrier.
DSB — All stores and loads must complete before the barrier. Used so stores complete before WFI or WFE.
ISB — Flushes the pipeline. Needed when jumping to code dynamically placed in memory.
ARMv8-A introduces additional barriers related to speculation, tracing and profiling.

AMD x64, Vol. 1, section 3.9.2.; ARM: Barrier Litmus Tests and Cookbook; Memory barriers are like source control operations; Who ordered memory fences on x86?.

Single copy atomicity

Single aligned stores are atomic on both x64 and ARM architectures.
On x64 with AVX, aligned 16-byte stores are atomic.
On x64, single stores within an 8-byte aligned region are atomic.

AMD x64, Vol. 2, section 7.3.2.; ARMv8-A, section B2.2.

Special CPU instructions

On x64, the LOCK prefix allows some load-modify-store instructions to be performed atomically. This includes ADD, SUB, INC, DEC, AND, OR, CMPCXHG, and CMPXCHG16B. The XCHG instruction implicitly has a lock-prefix if it accesses memory.

On ARM, starting with v7-M and v7-A, there are store/load-exclusive instructions, which can be used to write load-compare-store-loops. First you load, then you modify in register, then you store, and the store fails if somebody else stored between your load and store. Note that store/load-exclusive instructions on their own don't introduce memory barriers.

On ARMv7, the instructions are LDREX and STREX. There also is LDREXD/STREXD for 64-bit register atomic updates. On ARMv8, they are LDXR and STXR. There also is LDXP/STXP for 128-bit atomic updates.

Starting with ARMv-8, there are load-acquire instructions with A-suffix and store-release instructions L-suffix. There are both regular variants (e.g. LDAR) and execlusive variants (e.g. STLXR). These act as one-way barriers. Memory accesses after an acquire-barrier can't complete before the barrier. Memory accesses before a release-barrier must complete before the barrier.

Additionally, there is the processor-consistent acquire AP-suffix (e.g. LDAPR). I haven't understood what this is for yet.

AMD x64, Vol. 3, section 1.2.5; ARMv7-M, section A3.4; ARMv8-A, section B2.3.12; Load-Acquire and Store-Release instructions; Old New Thing on ARM barriers.

Brief ARM architecture overview

In general, the memory model rules apply to all ARM architecture versions. However, the newer architectures have more instructions to support multithreaded programming. My understanding is that on older architectures, pipelines were smaller and thus full barriers were less expensive relative to other instructions, so there was less of a need for more advanced synchronization mechanisms.

32-bit, microcontrollers	ARMv6-M	Cortex-M0 / M0+ / M1
	ARMv7-M	Cortex-M3 / M4 / M7
	ARMv8-M	Cortex-M33 / M35P / M55 / M85
32-bit, application	ARMv7-A	Cortex-A5 / A7 / A8 / A9 / A12 / A15 / A17
64-bit, application	ARMv8-A	Cortex-A32^* / A34 / A35 / A53 / A57 / A72 / A73
	ARMv8.2-A	Cortex-A55 / A65 / A75 / A76 / A77 / A78
	ARMv9-A	Cortex-A510 / A710 / A715

*All ARMv8-A cortex cores support 64-bit, except the Cortex-A32 which is 32-bit only.

The Raspberry Pi 2 is a Cortex-A7 (ARMv7-A), the Raspberry Pi 3 is a Cortex-A53 (ARMv8-A), and the Raspberry Pi 4 is a Cortex-A72 (ARMv8-A).

Wikipedia: Cortex-M and Cortex-A

What does the compiler do?

Volatile

In C and C++, the volatile keyword on a variable prevents the compiler from removing stores or loads to that variable. Commonly used to prevent the compiler from optimizing while (x) to while (1).

This only helps prevent compiler optimizations. It does not cause any barriers or special instructions to be generated. All the caveats of processor behaviour still apply.

<stdatomic.h> in C11 and <atomic> in C++11

Using these libraries prevent both compiler optimizations, insert necessary barrier instructions and generate special instructions as appropriate. You don't need to declare variables as volatile when using either atomic functions or types.

Generally, one doesn't have to worry about specific CPU details to write correct code when using these libraries, though knowing about CPU details can help write more performant code.

In C, there are atomic types, e.g. atomic_int. For these types, load-modify-store operators (e.g. ++ or +=) are atomic. But using these operators give less control than using the functions.

In C, there are functions such as atomic_fetch_add which take a pointers to volatile objects. Since T * implicitly casts to volatile T *, you still don't actually need to declare variables as volatile to use these functions.

In C++, the atomic types are defined as having overloaded operators to provide the same operations as for the types in C. The atomic functions take std::atomic<T> * as a parameter, so you either need to declare your variables as such, or cast when calling atomic functions. This prevents accidentally performing a non-atomic operation on an atomic variable.

Both in C and C++, the atomic functions allow specifying a memory order, e.g. memory_order_acquire. Acquire-semantics prevent memory accesses which come later from being moved in front of the acquire-access. Release-semantics require that memory accesses which came earlier complete before the the release-access. See this blog post for more details.

See cppreference (C++) and cppreference (C).

Windows

Windows provides Interlocked functions. For example InterlockedCompareExchange. Additionally, there are intrinsic, which are prefixed with an underscore, e.g. _InterlockedIncrement. In practice, both seem to produce identical code.

Linux kernel

The Linux kernel defines it's own set of primitives and helpers. Here are some links. Use bootlin to find usage examples.

docs.kernel.org: Atomic types, provides functions simmilar, but not equal to <stdatomic.h>.
lwn.net: Atomic primitives in the kernel, which also provides some history.
docs.kernel.org: RCU Concepts, which provides protection for mostly-read variables. Also see lwn.net: What is RCU, Fundamentally?
kernel.org: Memory barriers and lwn.net: Lockless patterns: full memory barriers.

Atomic operations cheatsheet