How to calculate the mean of a dataframe column and find the top 10%

Question

I am very new to Scala and Spark, and am working on some self-made exercises using baseball statistics. I am using a case class create a RDD and assign a schema to the data, and am then turning it into a DataFrame so I can use SparkSQL to select groups of players via their stats that meet certain criteria.

Once I have the subset of players I am interested in looking at further, I would like to find the mean of a column; eg Batting Average or RBIs. From there I would like to break all the players into percentile groups based on their average performance compared to all players; the top 10%, bottom 10%, 40-50%

I've been able to use the DataFrame.describe() function to return a summary of a desired column (mean, stddev, count, min, and max) all as strings though. Is there a better way to get just the mean and stddev as Doubles, and what is the best way of breaking the players into groups of 10-percentiles?

So far my thoughts are to find the values that bookend the percentile ranges and writing a function that groups players via comparators, but that feels like it is bordering on reinventing the wheel.

Looks like dataframe has some percentile stuff built in:http://stackoverflow.com/a/30900466/21755 Any use? — The Archetypal Paul, Jul 22 '15 at 14:20
I had tried that previously, but I get the following error: `Exception in thread "main" java.util.NoSuchElementException: key not found: PERCENTILE` — the3rdNotch, Jul 22 '15 at 14:48

score 0 · Answer 1 · answered Sep 28 '15 at 15:58

I was able to get the percentiles by using Windows Functions and apply ntile() and cumeDist() over the window. The ntile() can create grouping based off of an input number. If you want things grouped by 10%, just enter ntile(10), if by 5% then ntile(20). For a more fine-tuned restult, cumeDist() applied over the window will output a new column with the cumulative distribution, and those can be filtered from there through select(), where(), or a SQL query.

How to calculate the mean of a dataframe column and find the top 10%

1 Answers1