How to implement Imputation in spark

Question

I want to perform Mean, Median, Mode and use user defined value for imputation on spark dataframe
Is there any best way to do these in java.
For Example, suppose I am having these five columns and imputation can be performed on any of these :
id, name, age, marks, percentage

Netanel Malka · Answer 1 · 2022-05-08T18:40:22.647

You can use the Imputer class from the SparkML Package.

This is how you can do it in Scala:

import org.apache.spark.ml.feature.Imputer

val df = spark.createDataFrame(Seq[(Double, Double)](
      (8.0, 0),
      (5.0, 0),
      (0, 15.0),
      (4.0, 0),
      (5.0, 5.0)
    )).toDF("a", "b")
    
val imputer = new Imputer()
  .setStrategy("median")
  .setMissingValue(0)
  .setInputCols(Array("a","b"))
  .setOutputCols(Array("a_out","b_out"))

val model = imputer.fit(df)
val data = model.transform(df)
display(data)

The strategy implies how the imputation will be (from docs):

Imputation strategy. Available options are ["mean", "median", "mode"].

Links:

Imputer - Java Docs

Python Example

Now I am able to do this, but facing one issue when trying to use user defined value to use as missingValue, as it is taking default strategy "mean". Is there any way to handle this — ngi, May 09 '22 at 12:59

How to implement Imputation in spark

1 Answers1