PySpark percentile for multiple columns

Question

I want to convert multiple numeric columns of PySpark dataframe into its percentile values using PySpark, without changing its order.

E.g. given an array of column names arr = [Salary, Age, Bonus] to convert columns into percentiles.

Input

+----------+-------------+---------+--------+-----+-------+
| Empl. No | Dept        | Pincode | Salary | Age | Bonus |
+----------+-------------+---------+--------+-----+-------+
| 1        | HR          | 111     | 1000   | 45  | 100   |
| 2        | Sales       | 596     | 500    | 30  | 50    |
| 3        | Manufacture | 895     | 600    | 50  | 400   |
| 4        | HR          | 212     | 700    | 26  | 60    |
| 5        | Business    | 754     | 350    | 18  | 22    |
+----------+-------------+---------+--------+-----+-------+

Expected output

+----------+-------------+---------+--------+-----+-------+
| Empl. No | Dept        | Pincode | Salary | Age | Bonus |
+----------+-------------+---------+--------+-----+-------+
| 1        | HR          | 111     | 100    | 80  | 80    |
| 2        | Sales       | 596     | 40     | 60  | 40    |
| 3        | Manufacture | 895     | 60     | 100 | 100   |
| 4        | HR          | 212     | 80     | 40  | 60    |
| 5        | Business    | 754     | 20     | 20  | 20    |
+----------+-------------+---------+--------+-----+-------+

The formula for percentile for a given element 'x' in the list = (Number of elements less than 'x'/Total number of elements) *100.

score 1 · Answer 1 · answered Jul 28 '21 at 05:06

You can use percentile_approx for this , in conjunction with groupBy with the desired columns for which you want the percentile to be calculated.

Built in Spark > 3.x

input_list = [
    (1,"HR",111,1000,45,100)
    ,(2,"Sales",112,500,30,50)
    ,(3,"Manufacture",127,600,50,500)
    ,(4,"Hr",821,700,26,60)
    ,(5,"Business",754,350,18,22)
]

sparkDF = sql.createDataFrame(input_list,['emp_no','dept','pincode','salary','age','bonus'])

sparkDF.groupBy(['emp_no','dept']).agg(
    *[ F.first(F.col('pincode')).alias('pincode') ]
    ,*[ F.percentile_approx(F.col(col),0.95).alias(col) for col in ['salary','age','bonus'] ]
).show()

+------+-----------+-------+------+---+-----+
|emp_no|       dept|pincode|salary|age|bonus|
+------+-----------+-------+------+---+-----+
|     3|Manufacture|    127|   600| 50|  500|
|     1|         HR|    111|  1000| 45|  100|
|     2|      Sales|    112|   500| 30|   50|
|     5|   Business|    754|   350| 18|   22|
|     4|         Hr|    821|   700| 26|   60|
+------+-----------+-------+------+---+-----+

Thanks for the input Vaebhav, I would like to have percentile in the output. — Coder_noobie, Jul 28 '21 at 15:46
`percentile_output` is providing you that. output that you see is a percentile output — Vaebhav, Jul 28 '21 at 17:02
I see Output shown here has original values and not the percentiles ? Even order doesnt matter for me, but I need to get an output shown in my post i.e. values calculated for percentile for given column. Also, I am not sure what 0.95 does in your solution. — Coder_noobie, Jul 29 '21 at 00:37

ZygD · Answer 2 · 2022-03-18T15:36:27.220

Spark has a window function for calculating percentiles which is called percent_rank.

Test df:

from pyspark.sql import SparkSession, functions as F, Window as W
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
    [(1, "HR", 111, 1000, 45, 100),
     (2, "Sales", 596, 500, 30, 50),
     (3, "Manufacture", 895, 600, 50, 400),
     (4, "HR", 212, 700, 26, 60),
     (5, "Business", 754, 350, 18, 22)],
    ['Empl_No', 'Dept', 'Pincode', 'Salary', 'Age', 'Bonus'])
df.show()
#  +-------+-----------+-------+------+---+-----+
#  |Empl_No|       Dept|Pincode|Salary|Age|Bonus|
#  +-------+-----------+-------+------+---+-----+
#  |      1|         HR|    111|  1000| 45|  100|
#  |      2|      Sales|    596|   500| 30|   50|
#  |      3|Manufacture|    895|   600| 50|  400|
#  |      4|         HR|    212|   700| 26|   60|
#  |      5|   Business|    754|   350| 18|   22|
#  +-------+-----------+-------+------+---+-----+

percent_rank works in a way that the smallest value gets percentile 0 and the biggest value gets 1.

arr = ['Salary', 'Age', 'Bonus']
df = df.select(
    *[c for c in df.columns if c not in arr],
    *[F.percent_rank().over(W.orderBy(c)).alias(c) for c in arr]
).sort('Empl_No')

df.show()
#  +-------+-----------+-------+------+----+-----+
#  |Empl_No|       Dept|Pincode|Salary| Age|Bonus|
#  +-------+-----------+-------+------+----+-----+
#  |      1|         HR|    111|   1.0|0.75| 0.75|
#  |      2|      Sales|    596|  0.25| 0.5| 0.25|
#  |      3|Manufacture|    895|   0.5| 1.0|  1.0|
#  |      4|         HR|    212|  0.75|0.25|  0.5|
#  |      5|   Business|    754|   0.0| 0.0|  0.0|
#  +-------+-----------+-------+------+----+-----+

However, your expectation is somewhat different. You expect it to assume 0 as the smallest value even though it does not exist in the columns.

To solve this I will add a row with 0 values and later it will be deleted.

arr = ['Salary', 'Age', 'Bonus']

# Adding a row containing 0 values
df = df.limit(1).withColumn('Dept', F.lit('_tmp')).select(
    *[c for c in df.columns if c not in arr],
    *[F.lit(0).alias(c) for c in arr]
).union(df)

# Calculating percentiles
df = df.select(
    *[c for c in df.columns if c not in arr],
    *[F.percent_rank().over(W.orderBy(c)).alias(c) for c in arr]
).sort('Empl_No')

# Removing the fake row
df = df.filter("Dept != '_tmp'")

df.show()
#  +-------+-----------+-------+------+---+-----+
#  |Empl_No|       Dept|Pincode|Salary|Age|Bonus|
#  +-------+-----------+-------+------+---+-----+
#  |      1|         HR|    111|   1.0|0.8|  0.8|
#  |      2|      Sales|    596|   0.4|0.6|  0.4|
#  |      3|Manufacture|    895|   0.6|1.0|  1.0|
#  |      4|         HR|    212|   0.8|0.4|  0.6|
#  |      5|   Business|    754|   0.2|0.2|  0.2|
#  +-------+-----------+-------+------+---+-----+

You can multiply the percentile by 100 if you like:

*[(100 * F.percent_rank().over(W.orderBy(c))).alias(c) for c in arr]

Then you get...

+-------+-----------+-------+------+-----+-----+
|Empl_No|       Dept|Pincode|Salary|  Age|Bonus|
+-------+-----------+-------+------+-----+-----+
|      1|         HR|    111| 100.0| 80.0| 80.0|
|      2|      Sales|    596|  40.0| 60.0| 40.0|
|      3|Manufacture|    895|  60.0|100.0|100.0|
|      4|         HR|    212|  80.0| 40.0| 60.0|
|      5|   Business|    754|  20.0| 20.0| 20.0|
+-------+-----------+-------+------+-----+-----+

I love the dynamic solution with not hard coded column names except (where it is necessary) for "arr"! — gilgamash, Apr 25 '22 at 11:28

PySpark percentile for multiple columns

2 Answers2

Built in Spark > 3.x