JavaProgrammingSenior Java Developer

At what threshold of CAS contention does **LongAdder** instantiate its striped cell array, and how does this spatial partitioning mitigate cache coherency traffic?

Pass interviews with Hintsage AI assistant

Answer to the question

History: Prior to Java 8, concurrent accumulation relied upon AtomicLong, whose single memory location became a scalability bottleneck under thread contention due to excessive cache line invalidation across CPU cores. LongAdder was introduced as part of the java.util.concurrent.atomic package to address this via a technique inspired by the Striped64 algorithm, dynamically partitioning write operations across multiple padded cells.

Problem: When numerous threads simultaneously attempt CAS operations on a shared AtomicLong, each failure triggers a cache coherence broadcast that serializes memory traffic and degrades throughput exponentially with core count. This phenomenon, known as cache line bouncing, prevents linear scalability even on otherwise embarrassingly parallel tasks.

Solution: LongAdder initially attempts updates on a single base field using CAS; only upon detecting contention—specifically when a thread fails to acquire the base lock after a probabilistic probing sequence (typically implemented via a collision counter and thread-local hash in Striped64)—does it lazily allocate an array of Cell objects annotated with @Contended. Each thread thereafter hashes to a distinct cell, performing uncontended additions on isolated cache lines, while the sum() method lazily aggregates these values only when a consistent snapshot is required.

Situation from life

A high-frequency trading platform required a global counter to validate order throughput across a 64-core deployment, initially implemented using AtomicLong. During market volatility spikes, the system exhibited nonlinear latency degradation where the 99th-percentile response time increased tenfold, profiling revealed that 40% of CPU cycles were wasted on cache coherency protocols contending for the counter's single memory address.

The engineering team considered three architectural solutions. First, they evaluated a manual thread-local counter map where each thread maintained an independent AtomicLong in a ConcurrentHashMap, periodically aggregated by a background reporter; while this eliminated contention, it introduced significant memory overhead per thread and complex lifecycle management during thread pool resizing, risking memory leaks in long-running executors. Second, they prototyped a custom sharding strategy using an array of 64 AtomicLong instances indexed by Thread.currentThread().getId() % 64; this reduced cache traffic but suffered from uneven distribution when thread pools reused IDs and required manual handling of array resizing during traffic growth, adding brittle maintenance burden. Third, they assessed migrating to LongAdder, which offered built-in dynamic striping with automatic @Contended padding to prevent false sharing, albeit with the trade-off that read operations would return weakly consistent approximations rather than exact atomic values.

The team ultimately selected LongAdder because the business requirement tolerated slightly stale read values for monitoring dashboards, while the write-heavy validation path demanded maximum throughput. The automatic cell expansion heuristic ensured that during low-traffic periods the object remained lightweight (single base field), while high contention triggered transparent scaling across padded cells. Post-deployment latency stabilized, with throughput scaling linearly up to 64 cores as cache invalidation traffic distributed across distinct memory regions rather than concentrating on a single hotspot.

What candidates often miss

Question: Why does frequent polling of LongAdder.sum() in a tight loop potentially negate the performance benefits of striping, and what consistency guarantees does this method provide?

Answer: The sum() method must traverse the base field and every active Cell in the array to compute a total, requiring memory fences that trigger cache coherence synchronization across all participating cores; consequently, continuous read-heavy workloads effectively serialize the striped writes and reintroduce the contention LongAdder was designed to avoid. Furthermore, sum() offers only weak consistency, returning a value accurate solely at the moment of invocation without atomicity guarantees relative to concurrent updates, meaning the result may represent a transient state where some threads' increments are visible while others are not.

Question: How does the @Contended annotation within LongAdder's internal Cell class prevent false sharing, and what JVM flag governs this padding behavior?

Answer: @Contended instructs the HotSpot compiler to inject 128 bytes (or the value specified by -XX:ContendedPaddingWidth) of padding around the value field within each Cell, ensuring that adjacent array elements reside on distinct cache lines regardless of object layout optimizations. Without this padding, sequential cells would share a 64-byte cache line, causing writes to one cell to invalidate cached copies of neighbors in other cores and reintroducing cache bouncing; candidates frequently overlook that this annotation is reserved for JDK internal classes unless -XX:-RestrictContended is explicitly disabled to allow user-code exploitation.

Question: Under what specific circumstances would LongAdder exhibit worse performance than AtomicLong, and how does the longValue() implementation influence this hazard?

Answer: LongAdder incurs allocation overhead for its Cell array and hash calculation logic even during uncontended single-threaded execution, making AtomicLong superior for low-contention scenarios or counters updated exclusively by one thread. Additionally, longValue() delegates directly to sum(), meaning any code path that continuously checks the counter's value—such as a spin-lock or backpressure algorithm—forces repeated global aggregation that synchronizes all cache lines, effectively transforming the striped structure into a contended singleton and destroying scalability.