Grouped percentile using SparkR

Question

I want to calculate grouped percentiles using SparkR. I tried this

library(SparkR)
mtcars_spark %>% 
      SparkR::groupBy("cyl") %>%
      SparkR::summarize(p75 = approxQuantile("mpg", 0.75, 0.01),
                        p90 = approxQuantile("mpg", 0.90, 0.01),
                        p99 = approxQuantile("mpg", 0.99, 0.01))

...but, got this error:

unable to find an inherited method for function ‘approxQuantile’ for signature ‘"GroupedData", "character", "numeric", "numeric"’

How can I get the grouped percentiles using SparkR so that the desired output is the same as from the following code:

library(dplyr)
mtcars %>% 
  group_by(cyl) %>% 
  summarise(p75 = quantile(mpg, 0.75),
            p90 = quantile(mpg, 0.90),
            p99 = quantile(mpg, 0.99))

score 1 · Accepted Answer · answered Aug 28 '18 at 21:58

1

approxQuantile is a method which operates on Datasets - it has no variant that work on *GroupedDataset. If you've enabled Hive support, you use Hive's percentile UDF:

mtcars_spark %>% 
    SparkR::groupBy("cyl") %>%
    SparkR::summarize(p75 = expr("percentile(mpg, 0.75)"),
                      p90 = expr("percentile(mpg, 0.90)"),
                      p99 = expr("percentile(mpg, 0.99)"))

If not you could try gapply function, but it is likely to be much less efficient.

answered Aug 28 '18 at 21:58

zero323

322,348
103
959
935

Do I need to load any library to use Hive's percentile function? – Geet Aug 28 '18 at 22:14
wow...That worked! Thanks!! Where can I read more about this? – Geet Aug 28 '18 at 22:24
[Hive Language Manual - UDF section](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF). You can also check https://stackoverflow.com/q/52049152/6910411 and https://stackoverflow.com/q/34519549/6910411 – zero323 Aug 28 '18 at 23:12

Grouped percentile using SparkR

1 Answers1