0

What is best method to identify and replace outlier for ApplicantIncome, CoapplicantIncome,LoanAmount,Loan_Amount_Term column in pandas python.

I tried IQR with seaborne boxplot, and tried to identified the outlet and fill with NAN record after that take mean of ApplicantIncome and filled with NAN records.

Try to take group of below combination column ex: gender, education, selfemployed, Property_Area

And having below column in my dataframe

Loan_ID              LP001357
Gender                   Male
Married                   NaN
Dependents                NaN
Education            Graduate
Self_Employed              No
ApplicantIncome          3816
CoapplicantIncome         754
LoanAmount                160
Loan_Amount_Term          360
Credit_History              1
Property_Area           Urban
Loan_Status                 Y
Kalamarico
  • 5,466
  • 22
  • 53
  • 70

1 Answers1

1

Outliers

Just like missing values, your data might also contain values that diverge heavily from the big majority of your other data. These data points are called “outliers”. To find them, you can check the distribution of your single variables by means of a box plot or you can make a scatter plot of your data to identify data points that don’t lie in the “expected” area of the plot.

The causes for outliers in your data might vary, going from system errors to people interfering with the data through data entry or data processing, but it’s important to consider the effect that they can have on your analysis: they will change the result of statistical tests such as standard deviation, mean or median, they can potentially decrease the normality and impact the results of statistical models, such as regression or ANOVA.

To deal with outliers, you can either delete, transform, or impute them: the decision will again depend on the data context. That’s why it’s again important to understand your data and identify the cause for the outliers:

  • If the outlier value is due to data entry or data processing errors, you might consider deleting the value.
  • You can transform the outliers by assigning weights to your observations or use the natural log to reduce the variation that the outlier values in your data set cause.
  • Just like the missing values, you can also use imputation methods to replace the extreme values of your data with median, mean or mode values.

You can use the functions that were described in the above section to deal with outliers in your data.

Following links will be useful for you:

Python data cleaning

Ways to detect and remove the outliers

Jeyam Prakash
  • 211
  • 1
  • 6