0

I have used 'agg' and get average value of a column in my data frame, like this

df.groupBy('day','city')
  .agg(count("*"),
       avg(df.price).alias("avgPrice")
    )

From here Calculate percentile on pyspark dataframe columns, it said I can use df.selectExpr('percentile(MOU_G_EDUCATION_ADULT, 0.95)') to get 95 percentile of a column. So how can I add the that to inside the agg() function?

n179911a
  • 125
  • 1
  • 8

1 Answers1

1

You can use expr function to add in agg.

(df.groupBy('city')
 .agg(count("*"),
      avg(df.price).alias("avgPrice"),
      expr("percentile(price, 0.95)").alias("percentile"))
)

However, as the link suggested, if your dataset is large and do not mind some approximations, consider using percentile_approx.

(df.groupBy('city')
 .agg(count("*"),
      avg(df.price).alias("avgPrice"),
      percentile_approx('price', 0.95).alias('percentile'))
)
Emma
  • 8,518
  • 1
  • 18
  • 35