NaN in input vector for MLlib algorithms

Asked Apr 20 '16 at 13:11

Active Apr 25 '16 at 10:41

Viewed 114 times

I want to cluster my data using spark's MLlib functions. The problem is that in my dataset sometimes I get NULL as the features value.

I can't write 0.0 instead of it since it's just wrong. So I tried using Double.NaN for the value. This doesn't work and the clustering fails with:

java.lang.IllegalArgumentException: requirement failed

What is the common way to handle this issue?

edited Apr 25 '16 at 10:41

zero323

asked Apr 20 '16 at 13:11

antonpuz

`NaNs` values cannot be used by most of algorithms. Either drop missing data completely (column wise or row wise) or impute missing values (mean is relatively cheap and scalable way to do it). – zero323 Apr 20 '16 at 20:08

0 Answers0