0

I want to perform Mean, Median, Mode and use user defined value for imputation on spark dataframe
Is there any best way to do these in java.
For Example, suppose I am having these five columns and imputation can be performed on any of these :
id, name, age, marks, percentage

ngi
  • 51
  • 5

1 Answers1

1

You can use the Imputer class from the SparkML Package.

This is how you can do it in Scala:

import org.apache.spark.ml.feature.Imputer

val df = spark.createDataFrame(Seq[(Double, Double)](
      (8.0, 0),
      (5.0, 0),
      (0, 15.0),
      (4.0, 0),
      (5.0, 5.0)
    )).toDF("a", "b")
    
val imputer = new Imputer()
  .setStrategy("median")
  .setMissingValue(0)
  .setInputCols(Array("a","b"))
  .setOutputCols(Array("a_out","b_out"))

val model = imputer.fit(df)
val data = model.transform(df)
display(data)

enter image description here

The strategy implies how the imputation will be (from docs):

Imputation strategy. Available options are ["mean", "median", "mode"].

Links:

Imputer - Java Docs

Python Example

Netanel Malka
  • 346
  • 4
  • 11
  • Now I am able to do this, but facing one issue when trying to use user defined value to use as missingValue, as it is taking default strategy "mean". Is there any way to handle this – ngi May 09 '22 at 12:59