0

In the medical data, it's normal that there are lots of missing value. Now I am dealing with the data with tens of numerical features and many of them have lots of missing value for sure.

The dataset has only 188453 data(time stamp) with label 0 or 1, which is not a very big dataset so I'm not really intent to deleting data, and most of the label are 0(90% of the dataset). Quantity of some features are even below 10% of the whole dataset. Missing ratio in two labels are almost the same(correlation coefficient are almost 1).

I know there are several ways to deal with missing value, like deleting, mean imputation, and so on. I may try to use MICE though I don't know if it would work cause I notice that correlation coefficient in some features are not the same between label 0 and 1. For example, in label 0, corr coef between feature A and B is low, but high in label 1.

So, my question is:

  1. For those features which amount are below 10%(some even below 1 %), should I just give up them or it is okay to try to do MICE?
  2. I think it's better to do MICE differently for label 0 and label 1 cause corr coef in some features are not the same, but if I do so, I may not know how to deal with missing value in test data cause I wouldn't know the label of test data.
  3. Quantity of two labels are very imbalance. I don't have idea of how to do data augmentation with so much missing values.

Or if there are better ways to dealing situation like this, I am welcome to know. Sincerely thanks for reading my question and looking forward for answers!

CW.Chou
  • 33
  • 3

0 Answers0