0

I'm exploring SparkR to compute quantiles of a numeric column in a CSV file (located in S3). I'm able to parse the CSV file and print the documents and access the column. But not sure how to generate quantiles. Any help would be appreciated.

PS: R has inbuilt function to compute quantiles on the inbuilt dataframe (not on the SparkR dataframe).

devsathish
  • 2,339
  • 2
  • 20
  • 16
  • 1
    Quantiles on big data sets aren't the best idea. Probably the easiest way to fix it however, is by using sparkSql, if medians are already implemented. – Wannes Rosiers Jul 20 '15 at 10:58

2 Answers2

0

If you are open to a Spark + R answer not using SparkR, you can use dplyr with the dplyr.spark.hive backend.

mtcars_db  %>% mutate(q = quantile(mpg, .3))
Source: Spark at:localhost:10000
From: <derived table> [?? x 12]

     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb     q
   (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
1   21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4 15.68
2   21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4 15.68
3   22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1 15.68
4   21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1 15.68
5   18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2 15.68
6   18.1     6 225.0   105  2.76 3.460 20.22     1     0     3     1 15.68
7   14.3     8 360.0   245  3.21 3.570 15.84     0     0     3     4 15.68
8   24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2 15.68
9   22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2 15.68
10  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4 15.68
..   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...

where mtcars_db is a tbl backed by a spark sql table

piccolbo
  • 1,305
  • 7
  • 17
0

Spark 2.x comes with an approxQuantile function. If you set relativeError to zero it will give you the exact quantiles.

> sdf <- SparkR::createDataFrame(mtcars)
> quantiles <- approxQuantile(sdf, "mpg", c(0.5, 0.8), relativeError = 0.0)
> quantiles
[[1]]
[1] 19.2

[[2]]
[1] 26

More details here: http://spark.apache.org/docs/latest/api/R/approxQuantile.html

devlace
  • 331
  • 1
  • 7