Recipes for SQL

Basics

Tips

  • Materialize small chunks of tables (e.g. LIMIT 1000) to test queries on. This prevents you from wasting resources and potentially locking the database.
  • Write queries in logical pieces. Think about what the joins represent, what you're selecting, and use CTEs for logical grouping.
  • Use comments sparingly. The code should read like prose.
    • The risk of using comments is that they don't automatically update when you update the code. If that happens, the comment may have nothing to do with the code and you may get confused.

Window Functions

  • Commonly used when a query requires a "window" (group) of adjacent rows (e.g. rolling stats, running stats, previous/subsequent events, etc.)
  • Similar to GROUP BY, but aggregation happens only for adjacent rows instead of the entire window.
  • Can provide GROUP BY behaviour within the window using PARTITION BY
  • Ordering is not required if the table is guaranteed to be ordered. Ordering an ordered table may cause performance issues.
  • If using the same window repeatedly, some dialect allow for alises via WINDOW {name} AS (example in rolling_window_statistics)
  • Some examples provided below:

Rolling stats

  • Example with Bollinger Bands where $N=20, K=2$

    with stats as (
          select
              date_time
              , ticker
              , avg(price)    over sma20_window as SMA20     
              , stddev(price) over sma20_window as STD20  
          from stock_prices
          window sma20_window as (
              partition by ticker
              order by date_time
              rows between 19 preceding and current row
          )
      )
    
      select
          date_time
          , ticker
          , SMA20
          , SMA20 + 2*STD20 as BB20_2_UPPER
          , SMA20 - 2*STD20 as BB20_2_LOWER
      from stats
    
  • Alternatively, without WINDOW AS:

    with stats as (
          select
              date_time
              , ticker
              , avg(price) over (      
                  partition by ticker
                  order by date_time
                  rows between 19 preceding and current row) as SMA20
              , stddev(price) over (
                  partition by ticker
                  order by date_time
                  rows between 19 preceding and current row) as STD20_2
          from stock_prices
      )
    
      select
          date_time
          , ticker
          , SMA20
          , SMA20 + 2*STD20 as BB20_2_UPPER
          , SMA20 - 2*STD20 as BB20_2_LOWER
      from stats
    

Running stats

  • Specifying rows between unbounded gets all preceding rows
    select
          date_time
          , store
          , sum(sales) over stores_alltime as cumulative_sales   
      from store_sales
      window stores_alltime as (
          partition by store
          order by date_time
          rows between unbounded preceding and current row
      )
    

Previous or subsequent events

  • Use LAG(col, n) or LEAD(col, n) to move backwards or forwards by n rows
  • Previous events:

    select
          date_time
          , lag(sales, 1) -- sales from the previous row
      from store_sales
    
  • Subsequent events:

    select
          date_time
          , lead(sales, 1) -- sales from the subsequent row
      from store_sales
    

Ranking

  • Use RANK() and DENSE_RANK() for ranking. If two values are the same, they will have the same rank.
    • For RANK(): Ties are the same, but next rankings are skipped (e.g. if 3 items tied at rank 2, then next rank is 5)
    • For DENSE_RANK(): Ties are the same, and next rankings are not skipped (e.g. if 3 items tied at rank 2, then next rank is 3)
      select
          quarter
          , salesperson
          , rank()       over quarterly as rank
          , dense_rank() over quarterly as dense_rank
      from sales
      window quarterly as (partition by quarter order by sum_sales)
      

Percentiles

  • Use NTILE(k) to break data into k partitions. The return value is the partition that row belongs to.
    select
          quarter
          , salesperson
          , ntile(100) over (partition by quarter order by sum_sales) as percentile
      from sales
    

Subqueries and CTEs

  • Sometimes it's easier to write (and later read) your query as multiple stages
  • Performance in modern SQL dialects should be the same for CTEs vs subqueries

  • Without CTE:

    select *
      from (
          select *
          from tbl
      )
      group by def
    
  • With CTE:

    with xyz as (
          select *
          from tbl
      )
    
      select *
      from xyz
      group by def
    

Query Optimization

Minimizing Join Cost

  • Filter all component tables as much as possible before attempting a join. You can use CTEs for readability and organization.

    with table1_filtered as (
          select a,b,c
          from table1
          where condition1
          limit 10000
      )
    
      , table2_filtered as (
          select c,d,e
          from table2
          where condition2 and condition3
      )
    
      select t1.a, t1.b, t1.c, t2.d, t2.e
      from table1_filtered t1
      left join table2_filtered t2 on t1.c == t2.c
    

EXPLAIN ANALYZE

  • Returns the query plan and estimated cost for each step
  • If you think you can find a better plan than the optimizer, you can force a specific plan using the Query Store (or pg_stat_statements)

INDEX

  • By default, a clustered index is created on the primary key of a table.
  • A custom clustered index can speed up queries, but slow down inserts. For analytics DBs in general, faster queries are better.

MATERIALIZE (VIEW)

  • View: From a set of source tables, create an imaginary table that can be computed when requried
  • Materialized View: Store the intermediate steps in memory to make viewing the table faster (useful for analytical tables)

PIVOTs