4

I am building an automated cleaning process that clean null values from the dataset. I discovered few functions like mode, median, mean which could be used to fill NaN values in given data. But which one I should select? if data is categorical it has to be either mode or median while for continuous it has to be mean or median. So to define whether data is categorical or continuous I decided to make a machine learning classification model.

I took few features like,
1) standard deviation of data
2) Number of unique values in data
3) total number of rows of data
4) ratio of unique number of total rows
5) minimum value of data
6) maximum value of data
7) number of data between median and 75th percentile
8) number of data between median and 25th percentile
9) number of data between 75th percentile and upper whiskers
10) number of data between 25th percentile and lower whiskers
11) number of data above upper whisker
12) number of data below lower whisker

First with this 12 features and around 55 training data I used the logistic regression model on Normalized form to predict label 1(continuous) and 0(categorical).

Fun part is it worked!!

But, did I do it the right way? Is it a correct method to predict nature of data? Please advise me if I could improve it further.

Yash
  • 51
  • 4

2 Answers2

1

The data analysis seems awesome. For the part

But which one I should select?

Mean is always winner as far as I have tested. For every dataset I try out test for all the cases and compare accuracy.

There is a better approach but a bit time consuming. If you want to take forward this system, this can help.

For each column with missing data, find its nearest neighbor and replace it with that value. Suppose you have N columns excluding target, so for each column, treat it as dependent variable and rest of N-1 columns as independent. And find its nearest neighbor and then its output(dependent variable) is desired value for missing attribute.

Akash Kumar
  • 1,356
  • 1
  • 10
  • 28
0

But which one I should select? if data is categorical it has to be either mode or median while for continuous it has to be mean or median.

Usually for categorical data mode is used. For continuous - mean. But I recently saw an article where geometric mean was used for categorical values. If you build a model that uses columns with nan you can include columns with mean replacement, median replacement and also boolean column 'index is nan'. But better not to use linear models in this case - you can face correlation.

Besides there are many other methods to replace nan. For example, MICE algorithm.

Regarding the features you use. They are ok but I'd like to advice to add some more features related to distribution, for example:

  • skewness
  • kurtosis
  • similarity to Gaussian Distribution (and other distributions)
  • a number of 1D GDs you need to fit your column (GMM; won't perform well for 55 rows)

All this items you can get basing on normal data + transformed data (log, exp).

I explain: you can have a column with many categories inside. And it simply may look like numerical column with the old approach but it does not numerical. Distribution matching algorithm may help here.

Also you can use different normalizing. Probably RobustScaler from sklearn may work good (it may help in case where categories have levels very similar to 'outlied' values).

And the last advice: you can use Random forest model for this and get important columns. This list may give some direction for feature engineering/generation.

And, sure, take a look on misclassification matrix and for which features errors happen is also a good thing!

avchauzov
  • 1,007
  • 1
  • 8
  • 13