GROUP BY is used to group rows and aggregate data, but if misused, it can lead to serious errors or suboptimal performance.
Key nuances:
Example:
SELECT customer_id, COUNT(*) as orders FROM orders WHERE order_date >= '2024-01-01' GROUP BY customer_id HAVING COUNT(*) > 10;
Can you refer to fields in SELECT after GROUP BY that were not specified in either GROUP BY or an aggregate function?
Answer: No, this will lead to an error in most SQL implementations (e.g., in MS SQL, PostgreSQL). Some specific databases may return a random, incorrect value (especially in MySQL with sql_mode 'ONLY_FULL_GROUP_BY' turned off), but this behavior is incorrect and not guaranteed by the standard. A correct example:
SELECT department, AVG(salary) FROM employees GROUP BY department;
Story
In an e-commerce project, a report "revenue by products" was prepared using the query SELECT sku, price, SUM(qty) FROM orders GROUP BY sku. They did not consider: price wasn’t included in GROUP BY and was outside an aggregate function, the result — MySQL returned the first value of price, which during a promotion led to serious errors in the report. Correction — either add price to GROUP BY or use an aggregate function.
Story
In a BI project, a complex report with several JOINs and GROUP BY took 80 minutes instead of the planned 3. After analysis, it was found that there were no indexes on the GROUP BY and filtering fields, creating huge temporary tables for aggregation. Solution — index optimization and rewriting the query with table expressions.
Story
A developer used HAVING to filter values by a non-aggregated user attribute. As a result, the server processed grouping over all data, then removed them by HAVING, reducing performance. Fixed — moved this check to WHERE to narrow the selection before aggregation.