2

till now I was under the impression that machine learning algorithms (gbm, random forest, xgboost etc) can handle bad features (variable) present in the data.

In one of my problems, there are around 150 features and with xgboost I am getting a logloss of around 1 if I use all features. But if I remove around 10 bad features (found using some technique) I am observing a logloss of .45. That is huge improvement.

My question is, can bad features really make such big differences?

user3664020
  • 2,980
  • 6
  • 24
  • 45

2 Answers2

0

No algorithm will be able to deal perfectly with bad data representation, some are better (like deep learning) and some are worse in this manner, but all ML will suffer from bad data representation. This is one of the reasons for modern deep learning and assumption that we should work directly on the raw data instead of hand crafted features (which might be both great and very very missleading).

lejlot
  • 64,777
  • 8
  • 131
  • 164
  • I would change your example to L1 or Elastic-Net regularization. I'm not aware of any work showing that Deep Learning is particularly robust to noisy or irrelevant features. L1/Elastic-Net have been shown to have that kind of robustness though. – Raff.Edward Feb 27 '16 at 20:10
  • L1 is aggresive approach, to force to remove features. Which in case of valuable ones - lead to high bias. There is no perfect example, because as I said - **all** techniques suffer from bad representation. – lejlot Feb 27 '16 at 20:26
  • Yes, but there is theoretical and empirical work showing that L1/Elastic are robust to truly irrelevant features. I'm not aware of that for deep learning. Just because they aren't perfect doesn't mean they aren't better. – Raff.Edward Feb 28 '16 at 19:03
-1

No -

You are doing something wrong. Most likely the data you are evaluating is statistically different from your training data.

If the features you are talking about are not predictive w.r.t to the training data, they will be ignored by xgboost, so removing them won't impact anything. (Linear models don't have this luxury)

Put up some reproducible code and we can dig deeper.

T. Scharf
  • 4,644
  • 25
  • 27
  • 1
    This statement is simply wrong on both accounts. Xgboost, like every algorithm, will be impacted by noisy / irrelevant features. Linear models via Elastic/L1 regularization are actually one of the few models that *can* ignore non-predictive features (though not perfectly). That doesn't mean OP didn't make any mistake, but your statements are not correct . – Raff.Edward Feb 28 '16 at 19:07
  • Nah, it's very right. He didn't cut his error in half by removing 10 features. The reason L1 is used with linear models speaks precisely to the instability of linear models. Tree based are algorithms particularly resilient to this. I'll send ya 50$ on PayPal if you can produce a data set that, when run through a sensible Cv process can improve 50% by removing 10 features. (Serious offer , always ready to learn) – T. Scharf Feb 28 '16 at 19:29
  • Just to have fun... how about an example where noisy features, that objectively do not add anything, actually *improve* neg log likelihood. All IID so cross validation is kosher. http://pastebin.com/QUSq4ZT6 No, that doesn't speak to the instability of linear models. Instability would be high variance, which is not something I would use to describe a linear model. Tree models are decently resilient to noisy / irrelevant features, but they are not immune. All depends on the data. – Raff.Edward Feb 28 '16 at 22:19