1

I want to cluster my data using spark's MLlib functions. The problem is that in my dataset sometimes I get NULL as the features value.

I can't write 0.0 instead of it since it's just wrong. So I tried using Double.NaN for the value. This doesn't work and the clustering fails with:

java.lang.IllegalArgumentException: requirement failed

What is the common way to handle this issue?

zero323
  • 322,348
  • 103
  • 959
  • 935
antonpuz
  • 3,256
  • 4
  • 25
  • 48
  • `NaNs` values cannot be used by most of algorithms. Either drop missing data completely (column wise or row wise) or impute missing values (mean is relatively cheap and scalable way to do it). – zero323 Apr 20 '16 at 20:08

0 Answers0