Scala spark find median in a window partition

Question

I have a dataframe like this:

df = 
--------------
|col1 | col2 |
--------------
| A   | 1    |
| A   | 5    |
| B   | 0    |
| A   | 2    |
| B   | 6    |
| B   | 8    |
--------------

I want to partition by col1, find the median of col2 in each partition, and append the result to form a new column. The result should look like this:

result = 
---------------------
|col1 | col2 | col3 |
---------------------
| A   | 1    | 2    |
| A   | 5    | 2    |
| B   | 0    | 6    |
| A   | 2    | 2    |
| B   | 6    | 6    |
| B   | 8    | 8    |
---------------------

For now, I'm using this code:

val df2 = df
.withColumn("tmp", percent_rank over Window.partition('col1).orderBy('col2))
.where("tmp <= 0.5")
.groupBy("col1").agg(max(col2) as "col3")

val result = df.join(df2, df("col1") === df2("col1")).drop(df2("col1"))

But this takes too much time and space resources to run when the dataframe is big. Please help me find a way to do the above more efficiently! Any help is much appreciated!

score 1 · Accepted Answer · answered Jan 03 '17 at 14:31

With the data you have, you can do a Spark DataFrame groupBy statement with percentile_approx to perform the calculation.

// Creating the `df` dataset
val df = Seq(("A", 1), ("A", 5), ("B", 0), ("A", 2), ("B", 6), ("B", 8)).toDF("col1", "col2")
df.createOrReplaceTempView("df")

Use percentile_approx with groupBy to perform median calculation:

val df2 = spark.sql("select col1, percentile_approx(col2, 0.5) as median from df group by col1 order by col1")
df2.show()

with the output of df2 being:

+----+------+
|col1|median|
+----+------+
|   A|   2.0|
|   B|   6.0|
+----+------+

And now running the join to recreate the final result:

val result = df.join(df2, df("col1") === df2("col1"))
result.show()

//// output
+----+----+----+------+
|col1|col2|col1|median|
+----+----+----+------+
|   A|   1|   A|   2.0|
|   A|   5|   A|   2.0|
|   B|   0|   B|   6.0|
|   A|   2|   A|   2.0|
|   B|   6|   B|   6.0|
|   B|   8|   B|   6.0|
+----+----+----+------+

Scala spark find median in a window partition

1 Answers1