-2

i am using a dataset to build a model in machine learning. In the samples, there are 3 categories of labels like "abnormal" "normal" "data lost" .

It is the category "data lost" that confuse me. In the samples, this category means that some features in this row is null.

My question is : As null in the dataset should lead to a prediction "data lost". do I still need to fillna in datapreprocessing?

if I fillna my dataset with a value(mean / median.. whatever), the sample which should be predictd "data lost" will be confused?

Or is there a value I should used for fillna that can indicate it's

Kamook
  • 49
  • 9
  • Better to ask it in ["Data Science"](https://datascience.stackexchange.com/) or ["Cross Validated"](https://stats.stackexchange.com/). – LoMaPh Apr 10 '19 at 03:59

1 Answers1

0

The text below is valid if you plan to use LighGBM, XGBoost or CatBoost.

The most important thing is to check, if there is 100% confidence that every label "data lost" is connected with at least one null in the row, and every null in any column is connected with category "data lost". If so, you can exclude all those rows from train and test datasets, label them as "data lost" and train the rest using only two labels. Boring.

The most interesting situation is if the above is not fully true. In that case, you have to train using three labels, and some feature engineering and special imputing are needed. Primo, an additional feature being the sum of nulls in the row will be very helpful. Secundo, filling nulls is very important but not as mean/median/etc but as a value different from the others, e.g -9999999. And, what important - do not allow gbm methods to treat them as nulls. Why? Gbm methods find the cut value not taking into account nulls, and then check if it is better to connect nulls to the left or to the right leaf. This strategy is good in all cases but this one, with "data lost" as a label and nulls pointing this label as very probable.