0

I am currently working on how Machine Learning models can be interpreted and I found the function "pdp_plot" from the package PDPBox very useful to show how predicted outcome is impacted by changes in explanatory variables. However, I didn't find how to show all dumy variables, including the dummy variable dropped in the data pre-processing step.

In my initial dataset I had an explanatory variable called "Area" with 6 unique values: A, B, C, D, E, F. After creating dummy variables and dropping the first column, the dataset used for training my XGB model included Area_B, Area_C, Area_D, Area_E, Area_F.

When using the 'pdp_isolate' and then 'pdp_plot' functions from PDPBox, it shows the case where dummy variable Area_B = 1, then the case where dummy variable Area_C = 1, then the case where dummy variable Area_D = 1, etc. but it does not show the results for the case where all these dummy variables = 0. Does someone know how to display this as well?

Thanks a lot for your time. Hope the answer will also help the community. Please reach out if clarification is needed!

  • You are interpreting your machine learning model, and since `Area_A` is not included in the training data it is not seen by the model and model doesn't know about it, hence it is not showing up. – techytushar Feb 04 '21 at 15:08
  • Thanks @techytushar. Yes, that's the point.. So far the pdp_plot shows the case where dummy variable Area_B = 1, then the case where dummy variable Area_C = 1, then the case where dummy variable Area_D = 1, etc. but it does not show the results for the case where all these dummy variables = 0. You see what I mean? I will edit my question for clarification. – risk_hunter Feb 05 '21 at 07:32

0 Answers0