ProgrammingData Analyst / Backend Developer

What pitfalls are there when using the GROUP BY clause in SQL, especially regarding aggregation and optimization of complex queries?

Pass interviews with Hintsage AI assistant

Answer.

GROUP BY is used to group rows and aggregate data, but if misused, it can lead to serious errors or suboptimal performance.

Key nuances:

  • Only columns from GROUP BY or aggregate functions are allowed in SELECT.
  • In complex queries with multiple JOINs, duplicates and incorrect aggregation may occur.
  • The formal order: GROUP BY is executed after WHERE and before HAVING.
  • In the absence of indexing on group columns, the query may run very slowly on large datasets.
  • HAVING filters after grouping, while WHERE filters before it.

Example:

SELECT customer_id, COUNT(*) as orders FROM orders WHERE order_date >= '2024-01-01' GROUP BY customer_id HAVING COUNT(*) > 10;

Trick question.

Can you refer to fields in SELECT after GROUP BY that were not specified in either GROUP BY or an aggregate function?

Answer: No, this will lead to an error in most SQL implementations (e.g., in MS SQL, PostgreSQL). Some specific databases may return a random, incorrect value (especially in MySQL with sql_mode 'ONLY_FULL_GROUP_BY' turned off), but this behavior is incorrect and not guaranteed by the standard. A correct example:

SELECT department, AVG(salary) FROM employees GROUP BY department;

Examples of real errors due to ignorance of nuances in the topic.


Story

In an e-commerce project, a report "revenue by products" was prepared using the query SELECT sku, price, SUM(qty) FROM orders GROUP BY sku. They did not consider: price wasn’t included in GROUP BY and was outside an aggregate function, the result — MySQL returned the first value of price, which during a promotion led to serious errors in the report. Correction — either add price to GROUP BY or use an aggregate function.


Story

In a BI project, a complex report with several JOINs and GROUP BY took 80 minutes instead of the planned 3. After analysis, it was found that there were no indexes on the GROUP BY and filtering fields, creating huge temporary tables for aggregation. Solution — index optimization and rewriting the query with table expressions.


Story

A developer used HAVING to filter values by a non-aggregated user attribute. As a result, the server processed grouping over all data, then removed them by HAVING, reducing performance. Fixed — moved this check to WHERE to narrow the selection before aggregation.