In yesterday’s post I wondered why I can’t find the reduce
function. Well, as it turns out, this function was renamed to aggregate
in Spark 3.3.5. Mystery solved!
Now, using the aggregate
function, I managed to calculate the average temperature of my array in Fahrenheit:
spark.sql("""
SELECT
celsius,
aggregate(
celsius,
0,
(acc, t) -> acc + t,
acc -> round(acc / size(celsius) * 9 / 5 + 32, 1)
) as fahrenheit_average
FROM temperature_view;
""").show()
Output:
+--------------------+------------------+
| celsius|fahrenheit_average|
+--------------------+------------------+
|[35, 52, 32, 89, 98]| 142.2|
|[32, 52, 2, 9, 98...| 96.8|
+--------------------+------------------+
Function signature
Note that in Spark, the aggregate
function has a fourth argument instead of the usual three I’d expect from a “reduce” function:
- Input array to reduce:
celsius
- Start value of the accumulator:
0
- Merge function:
(acc, t) -> acc + t
- Finishing function:
acc -> round(acc / size(celsius) * 9 / 5 + 32, 1)
The finishing function is an extra feature of the Spark implementation, that allows more intricate operations on top of the merge function.