How to add a column to a PySpark dataframe which contains the nth quantile of another column in the dataframe

Question

I have a very large CSV file which has been imported as a PySpark dataframe: df. The dataframe contains many columns including column ireturn. I want to compute the 0.99 and 0.01 percentile of this column and then add another column to the dataframe df as new_col_99 and new_col_01 which contains the 0.99 and 0.01 percentiles, respectively. I wrote the following code which works for small dataframes, but I get errors when I apply it to my large dataframe.

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.csv("name of the file", inferSchema = True, header = True)

precentile_99 = df.selectExpr('percentile(val1, 0.99)').head(1)[0][0]
precentile_01 = df.selectExpr('percentile(val1, 0.01)').head(1)[0][0]
from pyspark.sql.functions import lit
df = df.withColumn("new_col_99", lit(precentile_99))
df = df.withColumn("new_col_01", lit(precentile_01))

I tried to replace head with collect, but it did not work either. I get this error:

Logging error ---
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:49850)
Traceback (most recent call last):...

I have also tried the following:

percentile = df.approxQuantile('ireturn',[0.01,0.99],0.25)
df = df.withColumn("new_col_01", lit(percentile[0]))
df = df.withColumn("new_col_99", lit(percentile[1]))

The above code takes about 15-20 min to run but the result is wrong (my data on the column ireturn are less than 1 but it returns the 0.99 percentile as 6789....)

What is exactly the error that you are getting? You might want to try `percentile_approx` if the cause is lack of memory. — martinarroyo, Jan 16 '19 at 08:19
I updated my answer. Could you please use **percentile_approx** in an example. I am not sure what is its format. — Monirrad, Jan 19 '19 at 19:47

score 0 · Answer 1 · answered Apr 05 '22 at 12:09

Late, but hopefully answers your concerns. You can get the result this way:

from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.read.csv("name of the file", inferSchema = True, header = True)

df = df.withColumn("new_col_99", F.expr('percentile(val1, 0.99) over()'))
df = df.withColumn("new_col_01", F.expr('percentile(val1, 0.01) over()'))

For large datasets percentile_approx may be better:

df = df.withColumn("new_col_99", F.expr('percentile_approx(val1, 0.99) over()'))
df = df.withColumn("new_col_01", F.expr('percentile_approx(val1, 0.01) over()'))

How to add a column to a PySpark dataframe which contains the nth quantile of another column in the dataframe

1 Answers1

Linked