GoProgrammingSenior Go Backend Engineer

Outline the mechanism by which Go's runtime distributes timer processing across per-P heaps to eliminate the global timer lock bottleneck, and specify the lock-free algorithm that handles concurrent timer modifications.

Pass interviews with Hintsage AI assistant

Answer to the question

History: Prior to Go 1.14, the runtime maintained a single global timer heap protected by a central lock. All goroutines creating or modifying timers contended for this lock, creating a severe scalability bottleneck in high-throughput network servers managing thousands of concurrent connections with timeouts.

The Problem: As core counts increased, the global timer lock became a serializing point. When a goroutine called time.AfterFunc or modified an existing timer, it had to acquire the global lock, update the 4-heap structure, and potentially wake the dedicated timer thread. This serialized access prevented timer operations from scaling horizontally with CPU cores, degrading tail latency under load.

The Solution: Go 1.14 redesigned the timer system to use per-P (processor) timer heaps. Each logical processor maintains its own 64-heap (4-heap variant) of timers. When a timer is created or reset, the runtime executes a lock-free algorithm using atomic compare-and-swap operations on the timer's status word (timers are represented by runtime.timer structs). If a timer is modified by a different P than its owner, the runtime uses an atomic update to move it between heaps without blocking the originating goroutine. The timerproc is now integrated into the scheduler's findRunnable loop, allowing each P to scan its local heap without global synchronization.

// Conceptual representation of timer modification func resetTimer(t *timer, when int64) { // Lock-free state transition using atomics for { old := atomic.Load(&t.status) if old == timerWaiting || old == timerRunning { // Attempt to steal or update atomically if atomic.CompareAndSwap(&t.status, old, timerModifying) { t.when = when // Rebalance within local P's heap atomic.Store(&t.status, timerWaiting) break } } } }

Situation from life

Problem Description: A high-frequency trading gateway written in Go experienced latency spikes exceeding 10ms during market open, despite having low CPU utilization. Profiling revealed that 40% of all mutex contention originated from runtime.timer operations, specifically from connection read deadlines being extended via SetReadDeadline. The operations team initially suspected network latency, but Go's execution tracer pinpointed the global timer lock as the culprit.

Different Solutions Considered:

One approach was to implement a user-space timing wheel outside the standard library. This would shard timers into buckets based on expiration time, using a fixed-size circular buffer. While this eliminated runtime lock contention, it introduced significant complexity: the trading team would need to maintain a separate goroutine for wheel advancement, handle overflow buckets for long timeouts, and ensure memory safety without the runtime's guarantees. Additionally, the wheel's granularity was insufficient for sub-millisecond trading requirements, and the implementation risked maintenance burden.

Another considered solution was to pool and reuse time.Timer objects aggressively to minimize allocations. This reduced GC pressure but did not address the fundamental contention on the global timer lock when calling Reset() or Stop(). The team also explored using time.Ticker with batched deadline checks, but this violated the exchange's requirement for immediate connection termination upon timeout, making it non-compliant with regulatory specifications.

Chosen Solution and Result: The team migrated to Go 1.15 (incorporating the per-P timer improvements) and replaced direct SetReadDeadline calls with a custom connection wrapper that managed deadline extensions through time.AfterFunc callbacks rather than resetting absolute deadlines. This change distributed timer entries across all available Ps, reducing the mutex contention to negligible levels. The result was a 95% reduction in p99 latency (from 12ms to 0.6ms) during peak trading volume, allowing the gateway to handle 100,000 concurrent connections without scheduler degradation.

What candidates often miss

How does the runtime handle timer migration when a goroutine moves between Ps, and why can't timers simply follow the goroutine?

Timers are bound to the P where they were created or last reset, not to the goroutine. When a goroutine migrates between Ps during work stealing, the timer remains on the original P's heap to avoid atomic overhead during every context switch. If the timer fires, the runtime sees that the associated goroutine is now running on a different P and enqueues the callback on that P's run queue. This separation is crucial because timer heaps require heap-invariant maintenance; allowing timers to migrate with goroutines would require locking both the source and destination P timer heaps during every steal, reintroducing the contention the per-P design eliminated.

What specific race condition necessitates the four-state atomic state machine (timerIdle, timerWaiting, timerRunning, timerModifying) in the timer implementation?

The state machine prevents the "lost wakeup" race where a timer is reset to a later time after it has been selected for execution but before its callback runs. Without atomic states, P A could select a timer from its heap (marking it as running), while P B simultaneously resets it. The four states ensure that a Reset operation sees the timerModifying or timerRunning state and spins until the timer is safe to modify. Candidates often miss that timerModifying acts as a transient spin-lock during state changes, preventing the callback from executing with stale data or being dropped entirely.

Why does the runtime maintain a 64-heap structure for timers rather than a standard binary heap, and how does this relate to cache line optimization?

The 64-heap (4-heap) reduces the depth of the tree to approximately log₄(n) levels compared to log₂(n), minimizing pointer chasing and cache misses during sift-up and sift-down operations. In a standard binary heap, each comparison requires loading two children (potentially two cache lines); the 4-heap loads four children at once, fitting into a single 64-byte cache line on modern x86_64 architectures. This structure is a deliberate compromise: while it increases the number of comparisons per level, it significantly reduces cache misses, which dominate the latency of timer heap operations when managing thousands of timers per P.