1

I stumbled about the following problem:

I'm working on a beginners project in data science. I got my test and train data splits and right now I'm analyzing every feature, then adding it to either a dataframe for discretised continuous variables or a dataframe for continuous variables. Doing so I encountered a feature with big outliers. If I would to delete them, other features I already added to my sub dataframes would have more column entries than this one.

Should I just find a strategy to overwrite the outliers with "better" values or should I reconsider my strategy to split the train data for both types of variables in the beginning? I don't think that getting rid of the outlier rows in the real train_data would be useful though...

Chris
  • 41
  • 1

1 Answers1

0

There are many ways to deal with outliers. In my cours for datascience we used "data imputation":

But before you start to replace or remove data, its important to analyse what difference the outlier makes and if the outlier is valid ofcours.

  • If the outlier is invalid, you can delete the outlier and use data imputation as explained below.

  • If your outlier is valid, check the differnce in outcome with and without the outlier. If the difference is very small then there ain't a problem. If the differnce is significant you can use standardization and normalization.

You can replace the outlier with:

  • a random value (not recommended)
  • a value based on hueristic logic
  • a value based on its neighbours
  • the median, mean or modus.
  • a value based on interpolation (making a prediction with a certain ml model)

I recommend using the strategy with the best outcome.

Statquest explains datascience and machinelearning concepts in a very easy and understandable way, so refer to him if you encounter more theoritical questions: https://www.youtube.com/user/joshstarmer