Loads can be reordred to run before preceding stores. Both on x64 and ARM it is possible to get x=0 and y=0.
Thread 1
*a = 1; int x = *b; |
Thread 2
*b = 1; int y = *a; |
On x64, stores are not re-ordered with other stores, loads are not re-ordered with other loads. Thread 2 sees either x=0, y=0 or x=1, y=1.
On ARM, it is possible to get x=1, y=0 or x=0, y=1. A DMB barrier is needed between the instruction pairs in both threads.
Thread 1
*a = 1; *b = 1; |
Thread 2
int x = *b; int y = *a; |
On x64, the LOCK prefix allows some load-modify-store instructions to be performed atomically. This includes ADD, SUB, INC, DEC, AND, OR, CMPCXHG, and CMPXCHG16B. The XCHG instruction implicitly has a lock-prefix if it accesses memory.
On ARM, starting with v7-M and v7-A, there are store/load-exclusive instructions, which can be used to write load-compare-store-loops. First you load, then you modify in register, then you store, and the store fails if somebody else stored between your load and store. Note that store/load-exclusive instructions on their own don't introduce memory barriers.
On ARMv7, the instructions are LDREX and STREX. There also is LDREXD/STREXD for 64-bit register atomic updates. On ARMv8, they are LDXR and STXR. There also is LDXP/STXP for 128-bit atomic updates.
Starting with ARMv-8, there are load-acquire instructions with A-suffix and store-release instructions L-suffix. There are both regular variants (e.g. LDAR) and execlusive variants (e.g. STLXR). These act as one-way barriers. Memory accesses after an acquire-barrier can't complete before the barrier. Memory accesses before a release-barrier must complete before the barrier.
Additionally, there is the processor-consistent acquire AP-suffix (e.g. LDAPR). I haven't understood what this is for yet.
AMD x64, Vol. 3, section 1.2.5; ARMv7-M, section A3.4; ARMv8-A, section B2.3.12; Load-Acquire and Store-Release instructions; Old New Thing on ARM barriers.In general, the memory model rules apply to all ARM architecture versions. However, the newer architectures have more instructions to support multithreaded programming. My understanding is that on older architectures, pipelines were smaller and thus full barriers were less expensive relative to other instructions, so there was less of a need for more advanced synchronization mechanisms.
32-bit, microcontrollers | ARMv6-M | Cortex-M0 / M0+ / M1 |
ARMv7-M | Cortex-M3 / M4 / M7 | |
ARMv8-M | Cortex-M33 / M35P / M55 / M85 | |
32-bit, application | ARMv7-A | Cortex-A5 / A7 / A8 / A9 / A12 / A15 / A17 |
64-bit, application | ARMv8-A | Cortex-A32* / A34 / A35 / A53 / A57 / A72 / A73 |
ARMv8.2-A | Cortex-A55 / A65 / A75 / A76 / A77 / A78 | |
ARMv9-A | Cortex-A510 / A710 / A715 |
*All ARMv8-A cortex cores support 64-bit, except the Cortex-A32 which is 32-bit only.
The Raspberry Pi 2 is a Cortex-A7 (ARMv7-A), the Raspberry Pi 3 is a Cortex-A53 (ARMv8-A), and the Raspberry Pi 4 is a Cortex-A72 (ARMv8-A).
Wikipedia: Cortex-M and Cortex-AIn C and C++, the volatile keyword on a variable prevents the compiler from removing stores or loads to that variable. Commonly used to prevent the compiler from optimizing while (x) to while (1).
This only helps prevent compiler optimizations. It does not cause any barriers or special instructions to be generated. All the caveats of processor behaviour still apply.
Using these libraries prevent both compiler optimizations, insert necessary barrier instructions and generate special instructions as appropriate. You don't need to declare variables as volatile when using either atomic functions or types.
Generally, one doesn't have to worry about specific CPU details to write correct code when using these libraries, though knowing about CPU details can help write more performant code.
In C, there are atomic types, e.g. atomic_int. For these types, load-modify-store operators (e.g. ++ or +=) are atomic. But using these operators give less control than using the functions.
In C, there are functions such as atomic_fetch_add which take a pointers to volatile objects. Since T * implicitly casts to volatile T *, you still don't actually need to declare variables as volatile to use these functions.
In C++, the atomic types are defined as having overloaded operators to provide the same operations as for the types in C. The atomic functions take std::atomic<T> * as a parameter, so you either need to declare your variables as such, or cast when calling atomic functions. This prevents accidentally performing a non-atomic operation on an atomic variable.
Both in C and C++, the atomic functions allow specifying a memory order, e.g. memory_order_acquire. Acquire-semantics prevent memory accesses which come later from being moved in front of the acquire-access. Release-semantics require that memory accesses which came earlier complete before the the release-access. See this blog post for more details.
See cppreference (C++) and cppreference (C).
Windows provides Interlocked functions. For example InterlockedCompareExchange.
Additionally, there are intrinsic, which are prefixed with an underscore, e.g.
The Linux kernel defines it's own set of primitives and helpers. Here are some links. Use bootlin to find usage examples.