JavaProgrammingSenior Java Developer

Via what specific optimization does **G1** transparently consolidate duplicate **String** backing arrays during routine garbage collection cycles without extending stop-the-world durations?

Pass interviews with Hintsage AI assistant

Answer to the question.

History of the question

Prior to Java 8 update 20, developers seeking to reduce heap consumption from duplicate String instances had to rely exclusively on String.intern(). This method placed strings in the permanent generation (later Metaspace), requiring explicit API calls and potentially causing memory pressure in the intern pool. With JEP 192, the G1 garbage collector introduced automatic String Deduplication, a transparent optimization targeting the ubiquitous problem of redundant character arrays in enterprise applications.

The problem

In data-intensive Java applications—such as those parsing XML, JSON, or database result sets—String objects often comprise 25-50% of the live heap. A significant portion of these strings are character-for-character identical but reside in distinct char[] (or byte[] post-Java 9 Compact Strings) backing arrays. Without intervention, these duplicate arrays waste memory and increase GC frequency. The challenge was to eliminate this redundancy without introducing additional stop-the-world pauses or requiring code modifications.

The solution

G1 performs deduplication opportunistically during its existing evacuation pause (when threads are already stopped). When enabled via -XX:+UseStringDeduplication, the collector scans objects in the young generation. For each String that has survived at least -XX:StringDeduplicationAgeThreshold garbage collections (default 3), G1 calculates a hash of its backing array. It then consults a deduplication table. If an identical array exists, G1 uses a compare-and-swap (CAS) operation to redirect the String's value field to the existing array, allowing the duplicate to be reclaimed in the next cycle. This leverages the existing pause, adding only marginal CPU overhead.

// No code changes required; JVM flags enable the optimization: // -XX:+UseG1GC -XX:+UseStringDeduplication -XX:StringDeduplicationAgeThreshold=3 public class DeduplicationExample { public static void main(String[] args) { // These two strings share the same backing array after deduplication String a = new String("FinancialInstrument".toCharArray()); String b = new String("FinancialInstrument".toCharArray()); // After sufficient GC cycles and evacuation pauses, // a.value == b.value (internal array reference equality) } }

Situation from life

A high-frequency trading platform processing FIX protocol messages experienced severe G1 pause times exceeding 200ms. Profiling revealed that 30% of the 64GB heap was consumed by String objects representing standard tags (e.g., "55", "150", "EUR/USD") and enum-like values parsed from incoming byte streams. Each message instantiation created new String instances via new String(byte[], Charset), resulting in millions of duplicate backing arrays per minute.

Several solutions were evaluated. String.intern() was rejected because it required invasive changes across 50+ message types and risked saturating the Metaspace with permanent references that would never be garbage collected. A custom WeakHashMap-based cache was prototyped, but introduced complex concurrency overhead and stale entry cleanup logic that paradoxically increased GC pressure due to the additional WeakReference processing.

The team ultimately enabled G1 String Deduplication with the default age threshold of 3. This transparent approach required zero code changes and operated during existing evacuation pauses, avoiding any new stop-the-world phases.

The result was a 22% reduction in heap usage and a drop in 95th-percentile pause times to under 50ms. The CPU overhead measured approximately 1.5% during peak market hours, an acceptable trade-off for the memory savings and latency improvement.

What candidates often miss

How does String deduplication interact with Java 9's Compact Strings, which store Latin-1 text as byte[] instead of char[]?

Answer. String Deduplication was updated to operate on byte[] arrays when Compact Strings is enabled (the default since Java 9). The deduplication logic inspects the coder field (LATIN1 or UTF16) and hashes the corresponding byte[] or char[] backing array accordingly. The deduplication table stores entries keyed by both hash and array type, ensuring that Latin-1 strings are deduplicated against other Latin-1 strings, and full-width UTF-16 strings against their peers. Candidates often mistakenly believe the feature was deprecated with Compact Strings, but it remains fully compatible.

Why does the JVM impose an age threshold (default 3 GCs) before a String becomes eligible for deduplication?

Answer. The age threshold prevents the system from wasting CPU cycles deduplicating short-lived, ephemeral strings that will likely die in the next young collection. By requiring the String to survive several G1 evacuation cycles (promoting it from Eden to Survivor regions and eventually toward Tenured), the heuristic ensures that only "mature" strings—those with a high probability of long-term survival—are processed. This amortizes the cost of the hash calculation and table lookup over the expected lifetime of the object.

Does String deduplication affect the immutability or hashCode stability of the String instance?

Answer. No. The deduplication process is strictly an implementation detail of the value field reference mutation. Since the replacement array contains identical bytes or characters, the String's logical state and hashCode remain unchanged. The hashCode is cached in a transient field within the String object itself, and because the content is identical, the cached value remains valid. The equals contract is preserved because content equality implies reference equality of the backing store is irrelevant to the API contract. The operation is atomic from the application perspective, maintaining String's immutable guarantee.