GoProgrammingGo Backend Engineer

What invariant ensures that Go's thread-local allocator can service small object requests without acquiring a global lock?

Pass interviews with Hintsage AI assistant

Answer to the question

History

Go's memory allocator descends from TCMalloc, Google's thread-caching malloc designed for C++ multi-threaded servers. The runtime implements a multi-level cache specifically to eliminate lock contention in concurrent programs. This design prioritizes throughput over memory efficiency in the small-object fast path.

The Problem

In highly concurrent services, requiring every allocation to acquire a global heap lock would serialize goroutines and destroy throughput. The challenge lies in providing O(1) allocation latency without synchronization for the common case while maintaining safety. Traditional malloc implementations suffer from cache line bouncing when multiple CPUs compete for the same lock word.

The Solution

The runtime maintains a per-P cache (mcache) containing spans for each of 67 size classes. When a goroutine allocates a small object (≤32KB), it either increments a boundary pointer or pops from a thread-local free list within its mcache, requiring no atomic operations. The critical invariant is that an mcache is owned exclusively by one P at any moment, and allocations never cross P boundaries, thus avoiding shared mutable state.

type PriceTick struct { Symbol uint32 Price float64 } func ProcessTick() { // Allocates 16 bytes from mcache without locking tick := &PriceTick{} _ = tick }

Situation from life

A high-frequency trading platform processed 500,000 market data events per second, with each event requiring temporary 24-byte structs for price normalization. The initial implementation utilized a global sync.Pool for these objects, which became a severe contention point under load, consuming 35% of CPU time in atomic operations and cache coherence traffic.

Solution A: Manual Pool Sharding

The team considered manually sharding the pool into 256 internal sub-pools selected by goroutine ID hash. Pros: Distributes contention across cache lines. Cons: Uneven utilization creates memory bloat in idle shards, and complex starvation handling is required when a local shard empties while others contain free objects.

Solution B: Per-Worker Arenas

They evaluated pre-allocating large memory arenas per worker goroutine with bump-pointer allocation. Pros: Zero contention and extremely fast allocation path. Cons: Requires manual memory management, risks leaking memory if reset pointers are mishandled, and complicates object lifetime tracking across asynchronous boundaries.

Solution C: Stack Allocation and Batching

The chosen approach restructured the event processor to use value structs instead of pointers, keeping data on the stack where possible, and processing events in batches of 1000 to amortize allocations. Pros: Eliminates heap pressure entirely for short-lived data and requires no synchronization primitives. Cons: Demanded significant refactoring of interfaces that previously expected pointer semantics and increased stack usage per goroutine.

Result

By implementing Solution C, the service eliminated 99% of heap allocations in the hot path. P99 latency dropped from 12 milliseconds to 180 microseconds, and garbage collection cycles decreased by 85%, allowing the service to meet its sub-millisecond SLA requirements.

What candidates often miss

How does Go bound memory fragmentation when allocating objects of varying sizes from fixed-size spans?

Go employs 67 distinct size classes with specific granularity (8-byte steps up to 512 bytes, then larger intervals). Objects are rounded up to the nearest class size, which bounds internal fragmentation to approximately 12.5%. External fragmentation is minimized because each mspan contains objects of exactly one size class, preventing small objects from pinning large blocks of memory.

Why does the runtime clear heap bitmaps rather than user-visible memory during allocation?

The allocator maintains type information and pointer bitmaps in heapArena metadata structures rather than in object headers. When memory is allocated, only the bitmaps indicating pointer slots are zeroed if necessary; the data memory is zeroed on demand by the mutator or during concurrent sweeping. This approach defers work, improves cache locality, and reduces the memory bandwidth required during allocation.

What forces a span to transition from mcache back to mcentral during garbage collection?

During the GC sweep phase, the runtime examines spans held in mcache instances. If a span contains no allocated objects (all slots freed), the P returns it to mcentral rather than retaining it. This prevents memory hoarding and ensures balanced distribution of free memory across processors, though it incurs the cost of reacquiring the central lock.