I have a dataframe like this:
df =
--------------
|col1 | col2 |
--------------
| A | 1 |
| A | 5 |
| B | 0 |
| A | 2 |
| B | 6 |
| B | 8 |
--------------
I want to partition by col1, find the median of col2 in each partition, and append the result to form a new column. The result should look like this:
result =
---------------------
|col1 | col2 | col3 |
---------------------
| A | 1 | 2 |
| A | 5 | 2 |
| B | 0 | 6 |
| A | 2 | 2 |
| B | 6 | 6 |
| B | 8 | 8 |
---------------------
For now, I'm using this code:
val df2 = df
.withColumn("tmp", percent_rank over Window.partition('col1).orderBy('col2))
.where("tmp <= 0.5")
.groupBy("col1").agg(max(col2) as "col3")
val result = df.join(df2, df("col1") === df2("col1")).drop(df2("col1"))
But this takes too much time and space resources to run when the dataframe is big. Please help me find a way to do the above more efficiently! Any help is much appreciated!