On my flight today, I played around with several dataframe operations. I wouldn’t say I learned anything new. The majority of the hour went to refreshing my understanding of window functions.

Fittingly, the example I was studying consisted of flights data:

+--------+-----+--------+------+-----------+
|    date|delay|distance|origin|destination|
+--------+-----+--------+------+-----------+
|01011245|    6|     602|   ABE|        ATL|
|01020600|   -8|     369|   ABE|        DTW|
|01021245|   -2|     602|   ABE|        ATL|
|01020605|   -4|     602|   ABE|        ATL|
|01031245|   -4|     602|   ABE|        ATL|
|01030605|    0|     602|   ABE|        ATL|
...

You may recognize this US flights delay dataset.

I had to calculate the top 3 destinations with the largest delays for every origin. This can be solved with so-called window functions.

I wrote a Common Table Expression (CTE) that first aggregated the delay by origin and destination. Subsequently, I created a rank column by a windowed function that ranks each row over each origin. Keeping the top 3 rows then gives the top 3 destinations with the largest delay for every origin:

WITH departureDelaysTotals AS (
  SELECT origin, destination, SUM(delay) AS delayTotal
  FROM departureDelays
  WHERE
    origin IN ('SEA', 'SFO', 'JFK') AND 
    destination IN ('SEA', 'SFO', 'JFK', 'DEN', 'ORD', 'LAX','ATL')
  GROUP BY origin, destination
)

SELECT origin, destination, delayTotal, rank
FROM 
  (
    SELECT
      origin,
      destination,
      delayTotal,
      dense_rank() OVER (
        PARTITION BY origin
        ORDER BY delayTotal DESC
      ) as rank
    FROM departureDelaysTotals
  )
WHERE rank <= 3

Not exactly novel, but it was nice to review window functions.