I want to perform Mean, Median, Mode and use user defined value for imputation on spark dataframe
Is there any best way to do these in java.
For Example, suppose I am having these five columns and imputation can be performed on any of these :
id, name, age, marks, percentage
Asked
Active
Viewed 217 times
0

ngi
- 51
- 5
1 Answers
1
You can use the Imputer
class from the SparkML Package.
This is how you can do it in Scala:
import org.apache.spark.ml.feature.Imputer
val df = spark.createDataFrame(Seq[(Double, Double)](
(8.0, 0),
(5.0, 0),
(0, 15.0),
(4.0, 0),
(5.0, 5.0)
)).toDF("a", "b")
val imputer = new Imputer()
.setStrategy("median")
.setMissingValue(0)
.setInputCols(Array("a","b"))
.setOutputCols(Array("a_out","b_out"))
val model = imputer.fit(df)
val data = model.transform(df)
display(data)
The strategy implies how the imputation will be (from docs):
Imputation strategy. Available options are ["mean", "median", "mode"].
Links:

Netanel Malka
- 346
- 4
- 11
-
Now I am able to do this, but facing one issue when trying to use user defined value to use as missingValue, as it is taking default strategy "mean". Is there any way to handle this – ngi May 09 '22 at 12:59