Recipes for SQL
Basics
- For basics on how to run 90% of queries, see the Mode Analytics guide for Basic and Intermediate SQL.
Tips
- Materialize small chunks of tables (e.g.
LIMIT 1000
) to test queries on. This prevents you from wasting resources and potentially locking the database. - Write queries in logical pieces. Think about what the joins represent, what you're selecting, and use CTEs for logical grouping.
- Use comments sparingly. The code should read like prose.
- The risk of using comments is that they don't automatically update when you update the code. If that happens, the comment may have nothing to do with the code and you may get confused.
Window Functions
- Commonly used when a query requires a "window" (group) of adjacent rows (e.g. rolling stats, running stats, previous/subsequent events, etc.)
- Similar to
GROUP BY
, but aggregation happens only for adjacent rows instead of the entire window. - Can provide
GROUP BY
behaviour within the window usingPARTITION BY
- Ordering is not required if the table is guaranteed to be ordered. Ordering an ordered table may cause performance issues.
- If using the same window repeatedly, some dialect allow for alises via
WINDOW {name} AS
(example in rolling_window_statistics) - Some examples provided below:
Rolling stats
Example with Bollinger Bands where $N=20, K=2$
with stats as ( select date_time , ticker , avg(price) over sma20_window as SMA20 , stddev(price) over sma20_window as STD20 from stock_prices window sma20_window as ( partition by ticker order by date_time rows between 19 preceding and current row ) ) select date_time , ticker , SMA20 , SMA20 + 2*STD20 as BB20_2_UPPER , SMA20 - 2*STD20 as BB20_2_LOWER from stats
Alternatively, without
WINDOW AS
:with stats as ( select date_time , ticker , avg(price) over ( partition by ticker order by date_time rows between 19 preceding and current row) as SMA20 , stddev(price) over ( partition by ticker order by date_time rows between 19 preceding and current row) as STD20_2 from stock_prices ) select date_time , ticker , SMA20 , SMA20 + 2*STD20 as BB20_2_UPPER , SMA20 - 2*STD20 as BB20_2_LOWER from stats
Previous or subsequent events
- Use
LAG(col, n)
orLEAD(col, n)
to move backwards or forwards byn
rows Previous events:
select date_time , lag(sales, 1) -- sales from the previous row from store_sales
Subsequent events:
select date_time , lead(sales, 1) -- sales from the subsequent row from store_sales
Ranking
- Use
RANK()
andDENSE_RANK()
for ranking. If two values are the same, they will have the same rank.- For
RANK()
: Ties are the same, but next rankings are skipped (e.g. if 3 items tied at rank 2, then next rank is 5) - For
DENSE_RANK()
: Ties are the same, and next rankings are not skipped (e.g. if 3 items tied at rank 2, then next rank is 3)select quarter , salesperson , rank() over quarterly as rank , dense_rank() over quarterly as dense_rank from sales window quarterly as (partition by quarter order by sum_sales)
- For
Subqueries and CTEs
- Sometimes it's easier to write (and later read) your query as multiple stages
Performance in modern SQL dialects should be the same for CTEs vs subqueries
- For older dialects, you may run into an optimization fence
Without CTE:
select * from ( select * from tbl ) group by def
With CTE:
with xyz as ( select * from tbl ) select * from xyz group by def
Query Optimization
Minimizing Join Cost
Filter all component tables as much as possible before attempting a join. You can use CTEs for readability and organization.
with table1_filtered as ( select a,b,c from table1 where condition1 limit 10000 ) , table2_filtered as ( select c,d,e from table2 where condition2 and condition3 ) select t1.a, t1.b, t1.c, t2.d, t2.e from table1_filtered t1 left join table2_filtered t2 on t1.c == t2.c
EXPLAIN ANALYZE
- Returns the query plan and estimated cost for each step
- If you think you can find a better plan than the optimizer, you can force a specific plan using the Query Store (or
pg_stat_statements
)
INDEX
- By default, a clustered index is created on the primary key of a table.
- A custom clustered index can speed up queries, but slow down inserts. For analytics DBs in general, faster queries are better.
MATERIALIZE (VIEW)
- View: From a set of source tables, create an imaginary table that can be computed when requried
- Materialized View: Store the intermediate steps in memory to make viewing the table faster (useful for analytical tables)
PIVOTs
- Personally, I find pivots in SQL to be very ugly. Try pivoting in another tool if possible (e.g. Spark, Pandas, etc.)
- Reference here: https://mode.com/sql-tutorial/sql-pivot-table/