2

I am going through the example (below): which is a binary classification example. I want to make sure some of my intuition is correct here as there is not much documentation regarding the lime package due to it being relatively new.

Output of the example

library(xgboost)
library(dplyr)
library(caret)
library(insuranceData) # example dataset https://cran.r-project.org/web/packages/insuranceData/insuranceData.pdf
library(lime) # Local Interpretable Model-Agnostic Explanations
set.seed(123)
data(dataCar)
mydb <- dataCar %>% select(clm, exposure, veh_value, veh_body,
                           veh_age, gender, area, agecat)

label_var <- "clm"  
offset_var <- "exposure"
feature_vars <- mydb %>% 
  select(-one_of(c(label_var, offset_var))) %>% 
  colnames()

#preparing data for xgboost (one hot encoding of categorical (factor) data
myformula <- paste0( "~", paste0( feature_vars, collapse = " + ") ) %>% as.formula()
dummyFier <- caret::dummyVars(myformula, data=mydb, fullRank = TRUE)
dummyVars.df <- predict(dummyFier,newdata = mydb)
mydb_dummy <- cbind(mydb %>% select(one_of(c(label_var, offset_var))), 
                    dummyVars.df)
rm(myformula, dummyFier, dummyVars.df)


feature_vars_dummy <-  mydb_dummy  %>% select(-one_of(c(label_var, offset_var))) %>% colnames()

xgbMatrix <- xgb.DMatrix(
  data = mydb_dummy %>% select(feature_vars_dummy) %>% as.matrix, 
  label = mydb_dummy %>% pull(label_var),
  missing = "NAN")


#model 2 : this works
myParam2 <- list(max.depth = 2,
                 eta = .01,
                 gamma = 0.001,
                 objective = 'binary:logistic',
                 eval_metric = "logloss")


booster2 <- xgb.train(
  params = myParam2, 
  data = xgbMatrix, 
  nround = 50)

explainer <- lime(mydb_dummy %>% select(feature_vars_dummy), 
                  model = booster2)

explanation <- explain(mydb_dummy %>% select(feature_vars_dummy) %>% head,
                       explainer,
                       n_labels = 2, ###### NOTE: I added this value, not sure if it should be '=1' or '=2' for binary classification.
                       n_features = 2)


plot_features(explanation)

The above code talks about insurance claims which is a classification problem, claim or not claim.

Questions:

What is the function of n_labels - I have a binary classification for my own problem so would n_lables correspond to 0 and 1?

In the example here the author talks about malignant and benign for the labels. However when I run the code on my own classification problem (I make sure that there is 0 and 1 observations in the data that I plot for plot_features(explanation), but the labels do not match the true value of what that observation is...

For my own problem I have the following plot;

I set n_labels = 1 (this plot is different to the code above (but still a classification problem)).

Here in case 2 I have under the labels header a result of 1 - which, can I assume is a binary classification prediction? However when I output the actual true results of the binary output I have the following 1 0 1 0 1 0 which I am reading that the model predicted that case 1 was classified as 0 and in actual fact it was a 1.case 2 prediction was a 1 and in fact was a 0, case 3 was predicted to be a 0 and infact it was a 1, case 4 was predicted to be a 0 and it was actually a 0 etc... is this incorrect? btw I used an xgb model to make predictions.

Secondly; All the 0 cases in the plot (so cases 1, 3, 4, 5, 6) all have similar characteristics... whereas case 2 is different and it has other variables/features which affect the model (I only plot 4 variables from the model (again I do not know if they are plotted randomly or by some importance score)).

I quote my analysis from the Understading lime here

In this overview it is clear to see how case 195 and 416 behave alike, while the third benign case (7) has an unusual large bare nuclei which are detracting from its status as benign without affecting the final prediction (indicating that the values of its other features are making up for this odd one). To no surprise it is clear that high values in the measurements are indicative of a malignant tumor.

If somebody can give me some sort of intuition/analys of the above plot it would be a great step in the right direction for me.

user113156
  • 6,761
  • 5
  • 35
  • 81
  • 1
    check the `model_r2` column in the output of `explain`, if it is low, then the model fit is not very good. You can try to change the feature selection type and the kernel width to improve it. If it is still bad then lime is not the right choice for the data in question. When running `xgboost` perhaps it is better to use `xgboostExplainer` because it was designed to extract the model built and explain its reasoning wheres `lime` builds it's own model making it applicable to many modeling techniques but certainly not as good as a dedicated explain-er. – missuse Apr 09 '18 at 06:12
  • Thanks for the comment! I have gone through many examples where the LIME package is being used and the `model_r2` is always rather low... for example using this example in this link ( http://www.business-science.io/business/2017/09/18/hr_employee_attrition.html ) the `model_r2` sits at around `0.22`, `0.23`, `0.24` for some variables. In my dataset I obtain an `R2` of around `0.40`. I have currently not changed the feature selection or modified the kernel width (but I will do so). In your experience what would be a suitable `model_r2`? – user113156 Apr 09 '18 at 21:43
  • 1
    I do not have much experience with lime, I messed a bit with the examples, tried it on a model I am working on (also xgboost), and I was not satisfied simply because the class labels from lime did not match the predictions obtained with xgboost (just like your case). So I settled on using `xgboostExplainer` which gives a nice breakdown of the prediction and partial dependency plots from `mlr `. – missuse Apr 11 '18 at 09:15

1 Answers1

0

To gain some intuition on the method and its interpretation I wrote a blog post where I rebuild LIME with a similar method (with full documented code): Explainable AI (XAI)… Explained! Or: How to whiten any Black Box with LIME

The main point is:

[LIME] has its equivalent in calculus: if you zoom in deep enough you can build most (even very complicated) function out of linear building blocks. This is what LIME basically does!

The length of the bars of the feature plots represent the size of the coefficients of the linear model!

vonjd
  • 4,202
  • 3
  • 44
  • 68