Answer to the question
PostgreSQL maintains per-column histograms in pg_statistic and assumes statistical independence between columns when estimating selectivity for multi-column filters. When you filter on highly correlated columns—such as city and state_code or make and model in a vehicle database—the planner multiplies individual selectivities, drastically underestimating the actual row count. This cardinality error often causes the optimizer to prefer Nested Loop joins over Hash or Merge joins, resulting in catastrophic performance on large tables.
To resolve this, you create an extended statistics object using CREATE STATISTICS, which builds multivariate correlation data. Specifically, the dependencies type tracks functional dependencies between columns, allowing the planner to recognize that filtering on state_code = 'CA' already constrains city to California values, avoiding the multiplication error.
-- Create extended statistics for correlated columns CREATE STATISTICS stats_vehicle_make_model ON make, model FROM vehicles; -- Populate the statistics ANALYZE vehicles;
Situation from life
A logistics platform struggled with a shipment tracking dashboard that joined a 50-million-row shipments table against customers. The query filtered by origin_state and origin_city, where 95% of rows for 'Springfield' as a city were actually in 'IL', but the planner assumed only 2% of shipments matched both predicates independently. It estimated 500 rows and chose a Nested Loop join, iterating through millions of customer records and timing out after 90 seconds.
One considered fix was disabling Nested Loop joins entirely via SET enable_nestloop = off in the session. This forced a Hash Join and completed in 3 seconds for this specific query, but carried severe risks: global configuration changes propagate across connection pools, potentially regressing other legitimate Nested Loop plans that perform well on small tables with indexed lookups. Additionally, this workaround required application-level code changes to set the parameter before querying.
Another option involved creating a composite index on (origin_city, origin_state). While this improved index selection, it did not solve the cardinality underestimation; the planner still thought few rows would emerge from the index scan and maintained the Nested Loop strategy, merely executing it faster through the covering index. Furthermore, the wide composite index consumed 4GB of additional disk space and slowed write operations on the high-velocity shipments table.
The team ultimately deployed extended statistics by running CREATE STATISTICS stats_origin_correlation ON origin_city, origin_state FROM shipments, followed by ANALYZE. This approach required no query rewrites and added negligible storage overhead. Post-deployment, the planner correctly estimated 45,000 rows and elected a Hash Join, reducing query latency to 400 milliseconds while preserving optimal plans for unrelated workloads.
What candidates often miss
How does the ANALYZE command refresh extended statistics, and why might a statistics object appear unused immediately after creation?
ANALYZE computes extended statistics only when invoked explicitly on the target table or when autovacuum processes the table after the statistics object exists. Many candidates assume that CREATE STATISTICS instantly affects planning, but the catalog tables pg_statistic_ext and pg_statistic_ext_data remain empty until the next analysis cycle. Consequently, the planner continues using single-column histograms and independence assumptions until ANALYZE shipments; populates the multivariate data. You can verify usage by checking the pg_stats_ext view for non-null dependencies or ndistinct values.
What is the functional difference between dependencies and ndistinct in CREATE STATISTICS, and which query patterns benefit from each?
Dependencies capture functional relationships where one column determines another (e.g., zip_code determines city), directly correcting WHERE clause selectivity estimates. Ndistinct calculates the exact number of distinct combinations for column groups, which improves GROUP BY and DISTINCT estimates rather than filter selectivity. Candidates often conflate these, creating dependencies when their slow query contains a GROUP BY on correlated columns, or vice versa. For optimal results, specify both types: CREATE STATISTICS stats_complex WITH (dependencies, ndistinct) ON (...).
Why might extended statistics fail to help with queries using OR conditions across correlated columns?
Extended statistics currently only assist with AND clauses where selectivity multiplication occurs. When you filter with OR (e.g., city = 'Springfield' OR state = 'IL'), PostgreSQL calculates selectivity using the formula P(A) + P(B) - P(A ∩ B), and it cannot apply dependency coefficients to the intersection term because the statistics track joint selectivity for conjunctions, not disjunctions. Candidates frequently miss this limitation and attempt to use CREATE STATISTICS to fix OR-based cardinality errors, which requires query rewriting (e.g., splitting into UNION ALL branches) or partial indexes instead.