11

I have a dataset where I find that the dependent (target) variable has a skewed distribution - i.e. there are a few very large values and a long tail.

When I run the regression tree, one end-node is created for the large-valued observations and one end-node is created for majority of the other observations.

Would it be ok to log transform the dependent (target) variable and use it for regression tree analysis ? When I tried this, I get a different set of nodes and splits that seem to have a more even distribution of observations in each bucket. With log transformation, the Rsquare value for Predicted vs. Observed is also quite good. In other words, I seem to get better testing and validation performance with log transformation. Just want to make sure log transformation is an accepted way to run regression tree when the dependent variable has a skewed distribution.

Thanks !

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
airjordan707
  • 111
  • 1
  • 1
  • 4
  • 2
    Sandeep's answer is correct. To be clear, you cannot compare the performance metrics of the two models. So just because your R-squared has gone up does not mean it's a better model. To compare apples-to-apples, you'd need to transform one of the prediction sets into the same scale as the other. So if you tune a model with the log-transformed target variable, you'll need to map the predictions back onto the original scale, using exp(), and then compare the metrics. – DangerousDave Jun 12 '18 at 12:31

1 Answers1

14

Yes. It is completely fine to apply log transformation on target variable when it has skewed distribution. That being said, you need to apply inverse function on top of the predicted values to get the actual predicted target value.

Moreover you have tested that by transforming you are getting better estimates on Rsquare error. I am assuming you have computed RSquare after inverting the log using exponent function.

For more details please refer, wiki link on data transformation.

Note that if your training data contains any negative target values, log transformation cannot be applied directly. You might have to apply some other functions which can accept negative values.

Sandeep
  • 546
  • 1
  • 5
  • 22