History
Modern CPUs employ cache coherency protocols such as MESI to synchronize data across private L1 caches of different cores. When independent threads write to distinct memory locations that accidentally reside on the same cache line (typically 64 or 128 bytes), the hardware serializes these operations by continuously invalidating and transferring ownership of that line, a phenomenon termed false sharing. C++17 introduced std::hardware_destructive_interference_size to expose the architecture’s cache line width, allowing developers to separate mutable data so that each thread’s hot variables occupy distinct lines and avoid this synchronization overhead.
Problem
Applying alignas(std::hardware_destructive_interference_size) to a variable with automatic storage duration ensures that the object’s starting address is a multiple of the cache line size within its specific thread’s stack frame. However, this alignment is local to the thread’s view of memory and does not guarantee exclusive occupancy of the physical cache line. If the object is smaller than the cache line, adjacent variables on the same stack—or variables on different threads’ stacks that happen to be allocated at physical addresses differing by multiples of the line size—can map to the same physical cache line. Consequently, the hardware still experiences coherency traffic when another thread writes to a different variable on that same line, rendering the alignas specification insufficient for isolation.
Solution
To guarantee avoidance of false sharing, the data must be padded to consume the entirety of the cache line, ensuring no other data shares the physical storage regardless of runtime address layout. This is accomplished by defining a struct that is both aligned to and sized according to std::hardware_destructive_interference_size.
#include <new> #include <cstddef> #include <atomic> struct alignas(std::hardware_destructive_interference_size) PaddedCounter { std::atomic<int> value; // Padding fills the remainder of the cache line to prevent sharing char padding[std::hardware_destructive_interference_size - sizeof(std::atomic<int>)]; }; // Array guarantees each element resides on a distinct cache line PaddedCounter thread_counters[8];
Problem description
A low-latency market-data processor utilized eight worker threads, each maintaining a per-thread tick counter in a global array of std::atomic<int> stats[8]. Each thread exclusively incremented its own index without locks, yet profiling revealed that throughput plateaued at a fraction of the theoretical maximum, with CPU counters showing excessive cache coherency cycles rather than user-mode computation. Investigation confirmed that the atomic integers, despite being logically independent, were packed contiguously into a single 64-byte cache line, causing destructive interference between cores.
Solution 1: Local aligned variables
The team initially attempted to declare alignas(64) std::atomic<int> local_stat inside each thread’s execution function, passing pointers to a monitoring thread. This approach required minimal refactoring and avoided global state. However, it proved unreliable because the compiler could place other automatic variables adjacent to local_stat within the same cache line, and different threads’ stack allocations could be separated by exact multiples of 64 bytes, causing the aligned variables to alias to the same physical line and perpetuating the false sharing.
Solution 2: Heap allocation with raw pointers
Another considered approach allocated each counter via new std::atomic<int> in the hope that the heap allocator would scatter allocations across distant memory addresses. While this sometimes reduced contention, it introduced nondeterministic performance because small allocations are often served from contiguous slabs, and allocator metadata can place distinct objects on the same cache line. Furthermore, this required manual memory management and did not provide compile-time guarantees of alignment or padding.
Chosen solution and result
The final implementation adopted the PaddedCounter struct defined above, storing instances in a static array. This solution was selected because it deterministically enforced cache-line separation through compile-time padding and alignment, eliminating hardware-level contention regardless of runtime memory layout. Memory consumption increased from 32 bytes to 512 bytes, which was acceptable for the performance gain. The result was a twelve-fold increase in throughput and a reduction in latency variance, meeting the sub-microsecond processing requirements.
Why does applying alignas(std::hardware_destructive_interference_size) to a small object fail to prevent false sharing with other data on the same thread?
alignas only controls the alignment of the object’s starting address, not its extent. If the object is smaller than the cache line (e.g., a 4-byte integer on a 64-byte line), the remaining bytes of that cache line can hold other variables. When the compiler places another variable on that same line, or when a different thread’s variable maps to that physical line, false sharing occurs. True isolation requires the object to occupy the entire line via padding, not merely to be aligned to its start.
What is the distinction between std::hardware_destructive_interference_size and std::hardware_constructive_interference_size, and when would grouping data to fit within the latter improve performance?
std::hardware_destructive_interference_size is the minimum separation required to avoid false sharing, while std::hardware_constructive_interference_size is the maximum size of data that benefits from spatial locality on a single cache line. Grouping related frequently-accessed fields (e.g., a point’s x, y, z coordinates) into a struct that fits within the constructive size ensures they reside on the same line, maximizing cache hit rates and prefetching efficiency, whereas destructive size is used to separate unrelated mutable data.
How does false sharing impact std::atomic operations using memory_order_relaxed, and why does the relaxed memory ordering not resolve the performance degradation?
Even with memory_order_relaxed, which imposes no ordering constraints on surrounding memory operations, an atomic write still requires the CPU core to acquire exclusive ownership of the cache line (a Read-For-Ownership cycle). If another thread recently modified a different variable on that same line, the cache coherency protocol forces the line to bounce between cores. This hardware-level synchronization occurs independently of the C++ memory model’s logical guarantees, meaning false sharing incurs full cache-miss latency regardless of the specified memory ordering.