How compute the percentile in PySpark dataframe for each key?

Question

I have a PySpark dataframe consists of three columns x, y, z.

X may have multiple rows in this dataframe. How can I compute the percentile of each key in x separately?

+------+---------+------+
|  Name|     Role|Salary|
+------+---------+------+
|   bob|Developer|125000|
|  mark|Developer|108000|
|  carl|   Tester| 70000|
|  carl|Developer|185000|
|  carl|   Tester| 65000|
| roman|   Tester| 82000|
| simon|Developer| 98000|
|  eric|Developer|144000|
|carlos|   Tester| 75000|
| henry|Developer|110000|
+------+---------+------+

The output needed:

+------+---------+------+---------+
|  Name|     Role|Salary|      50%|
+------+---------+------+---------+
|   bob|Developer|125000|117500.0 |
|  mark|Developer|108000|117500.0 |
|  carl|   Tester| 70000|72500.0  |
|  carl|Developer|185000|117500.0 |
|  carl|   Tester| 65000|72500.0  |
| roman|   Tester| 82000|72500.0  |
| simon|Developer| 98000|117500.0 |
|  eric|Developer|144000|117500.0 |
|carlos|   Tester| 75000|72500.0  |
| henry|Developer|110000|117500.0 |
+------+---------+------+---------+

@, I dont have idea how it will be the output, im just looking to split the role into several range. What i mean by split, is to take for example the developer between [0-25% of the developer salary], then [25-50%], [50-75%] and [75%-100%]. Im doing that because i have a huge dataframe and i nedd to know (filter) for example the developper with high salary, etc — bib, Aug 09 '19 at 14:08
You need to take the time to figure out what you want the output to be, and describe it in detail. What you're saying in your most recent comment and what you're asking in the question appear to be two different things. As it stands, this question may be closed as "Too Broad" or "Unclear what you're asking". — pault, Aug 09 '19 at 14:11
You may want to look into [Difference between QuantileDiscretizer and Bucketizer in Spark](https://stackoverflow.com/questions/43386188/difference-between-quantilediscretizer-and-bucketizer-in-spark) — pault, Aug 09 '19 at 14:13
@pault thank for the suggested links. May be I was not clear — bib, Aug 09 '19 at 14:26

score 24 · Accepted Answer · edited Apr 04 '22 at 07:51

Try groupby + F.expr:

import pyspark.sql.functions as F

df1 = df.groupby('Role').agg(F.expr('percentile(Salary, array(0.25))')[0].alias('%25'),
                             F.expr('percentile(Salary, array(0.50))')[0].alias('%50'),
                             F.expr('percentile(Salary, array(0.75))')[0].alias('%75'))
df1.show()

Output:

+---------+--------+--------+--------+
|     Role|     %25|     %50|     %75|
+---------+--------+--------+--------+
|   Tester| 68750.0| 72500.0| 76750.0|
|Developer|108500.0|117500.0|139250.0|
+---------+--------+--------+--------+

Now you might join df1 with the original dataframe:

df.join(df1, on='Role', how='left').show()

Output:

+---------+------+------+--------+--------+--------+
|     Role|  Name|Salary|     %25|     %50|     %75|
+---------+------+------+--------+--------+--------+
|   Tester|  carl| 70000| 68750.0| 72500.0| 76750.0|
|   Tester|  carl| 65000| 68750.0| 72500.0| 76750.0|
|   Tester| roman| 82000| 68750.0| 72500.0| 76750.0|
|   Tester|carlos| 75000| 68750.0| 72500.0| 76750.0|
|Developer|   bob|125000|108500.0|117500.0|139250.0|
|Developer|  mark|108000|108500.0|117500.0|139250.0|
|Developer|  carl|185000|108500.0|117500.0|139250.0|
|Developer| simon| 98000|108500.0|117500.0|139250.0|
|Developer|  eric|144000|108500.0|117500.0|139250.0|
|Developer| henry|110000|108500.0|117500.0|139250.0|
+---------+------+------+--------+--------+--------+

thanks for your answer, it was what im looking for with a small difference, how can get the values attached directly to the orignal datframe. please look the updated post — bib, Aug 09 '19 at 14:42
One option is to `join` the resulting dataframe with the original one. Another option might be using `window` function. Check updated answer please :) — Ala Tarighati, Aug 12 '19 at 06:25

score 5 · Answer 2 · answered Mar 18 '22 at 16:06

array is not really necessary:

F.expr('percentile(Salary, 0.5)')

Together with window function it does the job:

df = df.withColumn('50%', F.expr('percentile(Salary, 0.5)').over(W.partitionBy('Role')))

df.show()
#  +------+---------+------+--------+
#  |  Name|     Role|Salary|     50%|
#  +------+---------+------+--------+
#  |   bob|Developer|125000|117500.0|
#  |  mark|Developer|108000|117500.0|
#  |  carl|Developer|185000|117500.0|
#  | simon|Developer| 98000|117500.0|
#  |  eric|Developer|144000|117500.0|
#  | henry|Developer|110000|117500.0|
#  |  carl|   Tester| 70000| 72500.0|
#  |  carl|   Tester| 65000| 72500.0|
#  | roman|   Tester| 82000| 72500.0|
#  |carlos|   Tester| 75000| 72500.0|
#  +------+---------+------+--------+

Thank you for sharing this solution, I think it's very neat and probably faster than my `groupby+join` solution. — Ala Tarighati, Jun 10 '22 at 07:57

score 2 · Answer 3 · edited Jun 14 '23 at 20:53

2

You can try approxQuantile function available in spark.

https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.approxQuantile.html

edited Jun 14 '23 at 20:53

mazaneicha

8,794
4
33
52

answered Aug 09 '19 at 09:00

Neeraj Bhadani

2,930
16
26

This function works for the whole column. I m looking for each key to get its percentile – bib Aug 09 '19 at 12:23
@bib then using this function after a groupBy on the column x? – Ben.T Aug 09 '19 at 13:20

How compute the percentile in PySpark dataframe for each key?

3 Answers3