How to find the median in Apache Spark with Python Dataframe API?

Question

PySpark API provides many aggregate functions except the median. Spark 2 comes with approxQuantile which gives approximate quantiles but exact median is very expensive to calculate. Is there a more PySpark way of calculating median for a column of values in a Spark Dataframe?

score 8 · Accepted Answer · edited Nov 09 '17 at 23:25

Here is an example implementation with Dataframe API in Python (Spark 1.6 +).

import pyspark.sql.functions as F
import numpy as np
from pyspark.sql.types import FloatType

Let's assume we have monthly salaries for customers in "salaries" spark dataframe such as:

month | customer_id | salary

and we would like to find the median salary per customer throughout all the months

Step1: Write a user defined function to calculate the median

def find_median(values_list):
    try:
        median = np.median(values_list) #get the median of values in a list in each row
        return round(float(median),2)
    except Exception:
        return None #if there is anything wrong with the given values

median_finder = F.udf(find_median,FloatType())

Step 2: Aggregate on the salary column by collecting them into a list of salaries in each row:

salaries_list = salaries.groupBy("customer_id").agg(F.collect_list("salary").alias("salaries"))

Step 3: Call the median_finder udf on the salaries column and add the median values as a new column

salaries_list = salaries_list.withColumn("median",median_finder("salaries"))

using np.nanmedian(values_list) ignores NaNs and is sometimes a better choice — ad_s, Jun 28 '17 at 10:58
Thanks for the answer. What's the performance of this `np.median(values_list)` compared to the `approxQuantile` method provided by Spark? — jack, Mar 07 '21 at 15:57

score 1 · Answer 2 · answered Mar 19 '22 at 21:58

For exact median (for small-mid sized dataframes), since Spark 2.1 one can use percentile function wrapped in expr:

F.expr('percentile(c2, 0.5)')

df = spark.createDataFrame(
    [(1, 10),
     (1, 20),
     (2, 50)],
    ['c1', 'c2'])
df.groupby('c1').agg(F.expr('percentile(c2, 0.5)').alias('median')).show()
#  +---+------+
#  | c1|median|
#  +---+------+
#  |  1|  15.0|
#  |  2|  50.0|
#  +---+------+
df.withColumn('median', F.expr('percentile(c2, 0.5)').over(W.partitionBy('c1'))).show()
#  +---+---+------+
#  | c1| c2|median|
#  +---+---+------+
#  |  1| 10|  15.0|
#  |  1| 20|  15.0|
#  |  2| 50|  50.0|
#  +---+---+------+

Approximate median can often be a better choice for mid-large sized dataframes.

Spark 2.1 implements approx_percentile and percentile_approx:

F.expr('percentile_approx(c2, 0.5)')

Since Spark 3.1 one can use it in PySpark API directly:

F.percentile_approx('c2', 0.5)

How to find the median in Apache Spark with Python Dataframe API?

2 Answers2

Linked