Long time ago, I participated in this Kaggle competition: https://www.kaggle.com/competitions/bike-sharing-demand/data.
Check the code on Kaggle: https://www.kaggle.com/code/tchamna/bike-ride-sharing-prediction-xgboost-final:
Problem statement: You are provided hourly rental data spanning two years. For this competition, the training set is comprised of the first 19 days of each month, while the test set is the 20th to the end of the month. You must predict the total count of bikes rented during each hour covered by the test set, using only information available prior to the rental period.
Before anything else, I would like to let you know that I know that decision-trees-like models does not require any feature scaling to be able to do its job. I was just curious on seeing its effect on the convergence of the algorithm.
When I trained the data using original/unscaled data, the rmse of the training data would stall at around 41, for 100 trees. Increasing the number of trees pass this number would help decreasing the rmse of the training dataset, but have no effect on the rmse of the validation dataset. So, no point adding more trees!
The max_depth of the XGboost was set to 8.
With the scaled data using log(1+x) [to avoid log(0), the rmse of the training data and the validation data quickly converged to training: 0.106, and validation :0.31573, with only 50 trees!
I was so happy for this fast convergence. However, However and However, I was so chocked when I realized that the model trained using the scaled data was not giving good predictions on test inputs [Note that I converted the scaled data back to the original data using the exponential function].
See the plot of the results below.
Question: Have you ever noticed this type of issue with machine learning that use trees?
Like Random Forest and XGboost?
#xgboost #decisiontrees #machinelearning #featurescaling #featureengineering #datascience
Check the code on Kaggle: https://www.kaggle.com/code/tchamna/bike-ride-sharing-prediction-xgboost-final: