1

Current XGBoost algorithms are able to handle missing values by chosing the -best- direction during training by minimizing the loss (source). Within our institution this feature has been of great value as we are dealing with sparse tabular data.

Our next project is about detecting outliers in similar datasets; huge tabular datasets with relatively high amounts of missing data. One of the interesting techniques we've came across are Isolation Forests. Now, we would like to explore the possibility to integrate a feature such as XGBoost has for missing values into current Isolation Forests. Subsequently, I have two questions;

1] Would this idea of integrating missing data handling into Isolation Forests be technically feasible, and on top of that, make any sense?

2] Would other missing data handling techniques (e.g. imputation prior) or even other outlier detection algorithms work much better in these cases?

Please let me hear your advices, it would be of great value! Thank you in advance.

wptmdoorn
  • 160
  • 1
  • 12

1 Answers1

-1

Generally speaking tree based models handle missing value well although there is no harm in imputing the feature with median. ( This would likely be considered as a normal feature and therefore wouldn't contribute much )

cigien
  • 57,834
  • 11
  • 73
  • 112
DB2
  • 37
  • 6