On my flight today, I played around with several dataframe operations. I wouldn’t say I learned anything new. The majority of the hour went to refreshing my understanding of window functions.
Fittingly, the example I was studying consisted of flights data:
+--------+-----+--------+------+-----------+
| date|delay|distance|origin|destination|
+--------+-----+--------+------+-----------+
|01011245| 6| 602| ABE| ATL|
|01020600| -8| 369| ABE| DTW|
|01021245| -2| 602| ABE| ATL|
|01020605| -4| 602| ABE| ATL|
|01031245| -4| 602| ABE| ATL|
|01030605| 0| 602| ABE| ATL|
...
You may recognize this US flights delay dataset.
I had to calculate the top 3 destinations with the largest delays for every origin. This can be solved with so-called window functions.
I wrote a Common Table Expression (CTE) that first aggregated the delay by origin and destination. Subsequently, I created a rank
column by a windowed function that ranks each row over each origin. Keeping the top 3 rows then gives the top 3 destinations with the largest delay for every origin:
WITH departureDelaysTotals AS (
SELECT origin, destination, SUM(delay) AS delayTotal
FROM departureDelays
WHERE
origin IN ('SEA', 'SFO', 'JFK') AND
destination IN ('SEA', 'SFO', 'JFK', 'DEN', 'ORD', 'LAX','ATL')
GROUP BY origin, destination
)
SELECT origin, destination, delayTotal, rank
FROM
(
SELECT
origin,
destination,
delayTotal,
dense_rank() OVER (
PARTITION BY origin
ORDER BY delayTotal DESC
) as rank
FROM departureDelaysTotals
)
WHERE rank <= 3
Not exactly novel, but it was nice to review window functions.