Remove outliers in multiple columns from a spark dataframe

Asked Jul 03 '17 at 21:09

Active Jul 03 '17 at 21:09

Viewed 1,030 times

I have a dataset of around 10 integer features and I wish to remove outliers from my dataset, from each feature. What I have done in the past, is compute average and standard deviation for each feature and do a pass on the dataset, with discarding rows that qualify as outliers. Doing it on each column/ feature, helps me get rid of rows having at least one outlier feature.

Since parsing the dataset multiple times is not the optimal way, I was looking for ways to do this in a computation efficient manner. Can someone propose a better way so that the dataset can be parsed once and one can get rid of all outlier rows?

asked Jul 03 '17 at 21:09

disha

Remove outliers in multiple columns from a spark dataframe

0 Answers0