-1

SOME BACKGROUND

I am working on a training Random Forest regressor, for predicting yield in crops. Some of my predictor variables apply only to some cases, e.g. I have a variable denoting the number of rows, which only applies to crops grown in a polytunnel. If the crops are grown in a glasshouse, the number of rows does not apply, so it is left as a null value. I also have another variable which denotes whether the crop is grown under a polytunnel or glasshouse.

THE PROBLEM

As Random Forest does not handle missing values, is there a strategy that could deal with cases where variables take null values for cases where they do not apply? Tutorials and papers on the topic suggest imputing the values, but under the scenarios they consider these variables still apply, and are missing because of some external factor (e.g. rich people don't generally want to reveal their salaries).

Bodwin
  • 11
  • 2
  • Yes the best way to approach the problem is to give to those cases a special value. Ad example if for the polytunnel crops the number of rows ranges in [0,100], to all the samples in glasshouse you will give -1. What you should have is that the tree will use the polutunnel/galsshouse variable to split the data. Then, the data in polytunnel will be evaluated according to the number of rows while the number of rows will be ignored in glasshouse since is constant. – Roberto Nov 13 '18 at 13:54
  • Thank you for the answer - I have now applied your method to my data. My only worry is whether it will actually split on glasshouse/polytunnel - for all I know random forest might decide to use number of rows first, in which case the -1 fill values will have an interesting consequences. I recognise this depends on the underlying data, so as long as I am taking the best approach in the current circumstances, I am happy! – Bodwin Nov 14 '18 at 08:49
  • That is a fair. So I suggest you to check what happen plotting the tree structure. If you have small dataset you could try to compute the entropy/gini values to check manually what happen. I will post the comment as answer – Roberto Nov 14 '18 at 08:52

1 Answers1

1

The best way to approach the problem is to give to those cases a special value.

Ad example if for the polytunnel crops the number of rows ranges in [0,100], to all the samples in glasshouse you will give -1.

What you should have is that the tree will use the polutunnel/galsshouse variable to split the data. Then, the data in polytunnel will be evaluated according to the number of rows while the number of rows will be ignored in glasshouse since is constant.

Roberto
  • 745
  • 4
  • 19