In yesterday’s post I wondered why I can’t find the reduce function. Well, as it turns out, this function was renamed to aggregate in Spark 3.3.5. Mystery solved!

Now, using the aggregate function, I managed to calculate the average temperature of my array in Fahrenheit:

spark.sql("""
  SELECT
    celsius,
    aggregate(
      celsius,
      0,
      (acc, t) -> acc + t,
      acc -> round(acc / size(celsius) * 9 / 5 + 32, 1)
    ) as fahrenheit_average
  FROM temperature_view;
""").show()

Output:

+--------------------+------------------+
|             celsius|fahrenheit_average|
+--------------------+------------------+
|[35, 52, 32, 89, 98]|             142.2|
|[32, 52, 2, 9, 98...|              96.8|
+--------------------+------------------+

Function signature

Note that in Spark, the aggregate function has a fourth argument instead of the usual three I’d expect from a “reduce” function:

  1. Input array to reduce: celsius
  2. Start value of the accumulator: 0
  3. Merge function: (acc, t) -> acc + t
  4. Finishing function: acc -> round(acc / size(celsius) * 9 / 5 + 32, 1)

The finishing function is an extra feature of the Spark implementation, that allows more intricate operations on top of the merge function.