1

I want to discretize a variable using R, preferably SparkR, so that the desired results would be like the following.

library(arules)

mtcars %>% mutate(bins = discretize(x = mpg, method = "interval", breaks = 4))

I checked the document but could see the non-R solutions only at https://spark.apache.org/docs/2.2.0/ml-features.html#bucketizer.

Please advise.

Geet
  • 2,515
  • 2
  • 19
  • 42

1 Answers1

1

In general SparkR provides a very limited subset of ML functions (a full support is planned for Spark 3.0, as a separate R package SPARK-24359 SPIP: ML Pipelines in R, though simple discretization like this, can be performed using CASE ... WHEN ... statements.

First compute the breaks:

df <- createDataFrame(mtcars)
min_max <- df %>% 
  select(min(df$mpg), max(df$mpg)) %>% 
  collect() %>% 
  unlist() 

n <- 4
breaks <- seq(min_max[[1]], min_max[[2]], length.out = n)

Then generate expression:

bucket <- purrr::map2(
    breaks[-n], breaks[-1], 
    function(x, y) between(column("mpg"), c(x, y))) %>% 
  purrr::reduce2(
    ., seq(length(.)),
    function(acc, x, y) otherwise(when(x, y), acc), 
    .init = lit(NA))

df %>% withColumn("bucket", bucket)
zero323
  • 322,348
  • 103
  • 959
  • 935