11

I have a PySpark dataframe consists of three columns x, y, z.

X may have multiple rows in this dataframe. How can I compute the percentile of each key in x separately?

+------+---------+------+
|  Name|     Role|Salary|
+------+---------+------+
|   bob|Developer|125000|
|  mark|Developer|108000|
|  carl|   Tester| 70000|
|  carl|Developer|185000|
|  carl|   Tester| 65000|
| roman|   Tester| 82000|
| simon|Developer| 98000|
|  eric|Developer|144000|
|carlos|   Tester| 75000|
| henry|Developer|110000|
+------+---------+------+

The output needed:

+------+---------+------+---------+
|  Name|     Role|Salary|      50%|
+------+---------+------+---------+
|   bob|Developer|125000|117500.0 |
|  mark|Developer|108000|117500.0 |
|  carl|   Tester| 70000|72500.0  |
|  carl|Developer|185000|117500.0 |
|  carl|   Tester| 65000|72500.0  |
| roman|   Tester| 82000|72500.0  |
| simon|Developer| 98000|117500.0 |
|  eric|Developer|144000|117500.0 |
|carlos|   Tester| 75000|72500.0  |
| henry|Developer|110000|117500.0 |
+------+---------+------+---------+
ZygD
  • 22,092
  • 39
  • 79
  • 102
bib
  • 944
  • 3
  • 15
  • 32
  • @, I dont have idea how it will be the output, im just looking to split the role into several range. What i mean by split, is to take for example the developer between [0-25% of the developer salary], then [25-50%], [50-75%] and [75%-100%]. Im doing that because i have a huge dataframe and i nedd to know (filter) for example the developper with high salary, etc – bib Aug 09 '19 at 14:08
  • You need to take the time to figure out what you want the output to be, and describe it in detail. What you're saying in your most recent comment and what you're asking in the question appear to be two different things. As it stands, this question may be closed as "Too Broad" or "Unclear what you're asking". – pault Aug 09 '19 at 14:11
  • You may want to look into [Difference between QuantileDiscretizer and Bucketizer in Spark](https://stackoverflow.com/questions/43386188/difference-between-quantilediscretizer-and-bucketizer-in-spark) – pault Aug 09 '19 at 14:13
  • @pault thank for the suggested links. May be I was not clear – bib Aug 09 '19 at 14:26

3 Answers3

24

Try groupby + F.expr:

import pyspark.sql.functions as F

df1 = df.groupby('Role').agg(F.expr('percentile(Salary, array(0.25))')[0].alias('%25'),
                             F.expr('percentile(Salary, array(0.50))')[0].alias('%50'),
                             F.expr('percentile(Salary, array(0.75))')[0].alias('%75'))
df1.show()

Output:

+---------+--------+--------+--------+
|     Role|     %25|     %50|     %75|
+---------+--------+--------+--------+
|   Tester| 68750.0| 72500.0| 76750.0|
|Developer|108500.0|117500.0|139250.0|
+---------+--------+--------+--------+

Now you might join df1 with the original dataframe:

df.join(df1, on='Role', how='left').show()

Output:

+---------+------+------+--------+--------+--------+
|     Role|  Name|Salary|     %25|     %50|     %75|
+---------+------+------+--------+--------+--------+
|   Tester|  carl| 70000| 68750.0| 72500.0| 76750.0|
|   Tester|  carl| 65000| 68750.0| 72500.0| 76750.0|
|   Tester| roman| 82000| 68750.0| 72500.0| 76750.0|
|   Tester|carlos| 75000| 68750.0| 72500.0| 76750.0|
|Developer|   bob|125000|108500.0|117500.0|139250.0|
|Developer|  mark|108000|108500.0|117500.0|139250.0|
|Developer|  carl|185000|108500.0|117500.0|139250.0|
|Developer| simon| 98000|108500.0|117500.0|139250.0|
|Developer|  eric|144000|108500.0|117500.0|139250.0|
|Developer| henry|110000|108500.0|117500.0|139250.0|
+---------+------+------+--------+--------+--------+
ZygD
  • 22,092
  • 39
  • 79
  • 102
Ala Tarighati
  • 3,507
  • 5
  • 17
  • 34
  • thanks for your answer, it was what im looking for with a small difference, how can get the values attached directly to the orignal datframe. please look the updated post – bib Aug 09 '19 at 14:42
  • One option is to `join` the resulting dataframe with the original one. Another option might be using `window` function. Check updated answer please :) – Ala Tarighati Aug 12 '19 at 06:25
5

array is not really necessary:

F.expr('percentile(Salary, 0.5)')

Together with window function it does the job:

df = df.withColumn('50%', F.expr('percentile(Salary, 0.5)').over(W.partitionBy('Role')))

df.show()
#  +------+---------+------+--------+
#  |  Name|     Role|Salary|     50%|
#  +------+---------+------+--------+
#  |   bob|Developer|125000|117500.0|
#  |  mark|Developer|108000|117500.0|
#  |  carl|Developer|185000|117500.0|
#  | simon|Developer| 98000|117500.0|
#  |  eric|Developer|144000|117500.0|
#  | henry|Developer|110000|117500.0|
#  |  carl|   Tester| 70000| 72500.0|
#  |  carl|   Tester| 65000| 72500.0|
#  | roman|   Tester| 82000| 72500.0|
#  |carlos|   Tester| 75000| 72500.0|
#  +------+---------+------+--------+
ZygD
  • 22,092
  • 39
  • 79
  • 102
2

You can try approxQuantile function available in spark.

https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.approxQuantile.html

mazaneicha
  • 8,794
  • 4
  • 33
  • 52
Neeraj Bhadani
  • 2,930
  • 16
  • 26