Basics

For basics on how to run 90% of queries, see the Mode Analytics guide for Basic and Intermediate SQL.

Tips

Materialize small chunks of tables (e.g. LIMIT 1000) to test queries on. This prevents you from wasting resources and potentially locking the database.
Write queries in logical pieces. Think about what the joins represent, what you're selecting, and use CTEs for logical grouping.
Use comments sparingly. The code should read like prose.
- The risk of using comments is that they don't automatically update when you update the code. If that happens, the comment may have nothing to do with the code and you may get confused.

Window Functions

Commonly used when a query requires a "window" (group) of adjacent rows (e.g. rolling stats, running stats, previous/subsequent events, etc.)
Similar to GROUP BY, but aggregation happens only for adjacent rows instead of the entire window.
Can provide GROUP BY behaviour within the window using PARTITION BY
Ordering is not required if the table is guaranteed to be ordered. Ordering an ordered table may cause performance issues.
If using the same window repeatedly, some dialect allow for alises via WINDOW {name} AS (example in rolling_window_statistics)
Some examples provided below:

Rolling stats

Example with Bollinger Bands where $N=20, K=2$

with stats as (
      select
          date_time
          , ticker
          , avg(price)    over sma20_window as SMA20     
          , stddev(price) over sma20_window as STD20  
      from stock_prices
      window sma20_window as (
          partition by ticker
          order by date_time
          rows between 19 preceding and current row
      )
  )

  select
      date_time
      , ticker
      , SMA20
      , SMA20 + 2*STD20 as BB20_2_UPPER
      , SMA20 - 2*STD20 as BB20_2_LOWER
  from stats

Alternatively, without WINDOW AS:

with stats as (
      select
          date_time
          , ticker
          , avg(price) over (      
              partition by ticker
              order by date_time
              rows between 19 preceding and current row) as SMA20
          , stddev(price) over (
              partition by ticker
              order by date_time
              rows between 19 preceding and current row) as STD20_2
      from stock_prices
  )

  select
      date_time
      , ticker
      , SMA20
      , SMA20 + 2*STD20 as BB20_2_UPPER
      , SMA20 - 2*STD20 as BB20_2_LOWER
  from stats

Running stats

Specifying rows between unbounded gets all preceding rows

select
      date_time
      , store
      , sum(sales) over stores_alltime as cumulative_sales   
  from store_sales
  window stores_alltime as (
      partition by store
      order by date_time
      rows between unbounded preceding and current row
  )

Previous or subsequent events

Use LAG(col, n) or LEAD(col, n) to move backwards or forwards by n rows

Previous events:

select
      date_time
      , lag(sales, 1) -- sales from the previous row
  from store_sales

Subsequent events:

select
      date_time
      , lead(sales, 1) -- sales from the subsequent row
  from store_sales

Ranking

Use RANK() and DENSE_RANK() for ranking. If two values are the same, they will have the same rank.
- For RANK(): Ties are the same, but next rankings are skipped (e.g. if 3 items tied at rank 2, then next rank is 5)
- For DENSE_RANK(): Ties are the same, and next rankings are not skipped (e.g. if 3 items tied at rank 2, then next rank is 3)
```
select
    quarter
    , salesperson
    , rank()       over quarterly as rank
    , dense_rank() over quarterly as dense_rank
from sales
window quarterly as (partition by quarter order by sum_sales)
```

Percentiles

Use NTILE(k) to break data into k partitions. The return value is the partition that row belongs to.

select
      quarter
      , salesperson
      , ntile(100) over (partition by quarter order by sum_sales) as percentile
  from sales

Subqueries and CTEs

Sometimes it's easier to write (and later read) your query as multiple stages
Performance in modern SQL dialects should be the same for CTEs vs subqueries
- For older dialects, you may run into an optimization fence

Without CTE:

select *
  from (
      select *
      from tbl
  )
  group by def

With CTE:

with xyz as (
      select *
      from tbl
  )

  select *
  from xyz
  group by def

Query Optimization

Minimizing Join Cost

Filter all component tables as much as possible before attempting a join. You can use CTEs for readability and organization.

with table1_filtered as (
      select a,b,c
      from table1
      where condition1
      limit 10000
  )

  , table2_filtered as (
      select c,d,e
      from table2
      where condition2 and condition3
  )

  select t1.a, t1.b, t1.c, t2.d, t2.e
  from table1_filtered t1
  left join table2_filtered t2 on t1.c == t2.c

EXPLAIN ANALYZE

Returns the query plan and estimated cost for each step
If you think you can find a better plan than the optimizer, you can force a specific plan using the Query Store (or pg_stat_statements)

INDEX

By default, a clustered index is created on the primary key of a table.
A custom clustered index can speed up queries, but slow down inserts. For analytics DBs in general, faster queries are better.

MATERIALIZE (VIEW)

View: From a set of source tables, create an imaginary table that can be computed when requried
Materialized View: Store the intermediate steps in memory to make viewing the table faster (useful for analytical tables)

PIVOTs

Personally, I find pivots in SQL to be very ugly. Try pivoting in another tool if possible (e.g. Spark, Pandas, etc.)
Reference here: https://mode.com/sql-tutorial/sql-pivot-table/

SQL