1

I am trying to work on local explainability using Lime graph. Before building the model, I encode some of the categorical variables.

Sample Data and code:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

df = pd.DataFrame({'customer_id' : np.arange(1,21),
                  'gender' : np.random.choice(['male','female'], 20),
                  'age' : np.random.randint(19,50, 20),
                  'salary' : np.random.randint(20000,95000, 20),
                  'purchased' : np.random.choice([0,1], 20, p = [.8,.2])})

Preprocessing:

df['gender'] = df['gender'].map({'female' : 0, 'male' : 1})

df['age'] = df['age'].map(lambda x : 'young' if x<=35 else 'middle aged')

df['age'] = df['age'].map({'young' : 0, 'middle aged' : 1})

bins = [0, df['salary'].quantile(q=.33),df['salary'].quantile(q=.66),df['salary'].quantile(q=1)+1]
labels = ['low salary', 'medium salary', 'high salary']
df['salary'] = pd.cut(df['salary'], bins = bins, labels=labels)

from sklearn import preprocessing
l_encoder={}
label_encoder = preprocessing.LabelEncoder()
df['salary']= label_encoder.fit_transform(df['salary'])
df

    customer_id gender  age salary  purchased
0   1           0       0   1       0
1   2           0       0   0       0
2   3           0       1   2       0
3   4           1       0   0       0
4   5           1       1   2       0
5   6           0       1   1       0
6   7           1       0   2       0
7   8           1       1   0       0
8   9           1       1   1       0
9   10          1       0   0       0
10  11          0       1   0       0
11  12          0       0   1       0
12  13          1       1   1       0
13  14          1       1   1       0
14  15          1       1   2       1
15  16          1       1   0       0
16  17          1       1   1       0
17  18          0       0   0       0
18  19          0       0   2       0
19  20          0       0   2       0


# input
x = df.iloc[:, :-1]
  
# output
y = df.iloc[:, 4]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 0)

Separating the customer_id column:

X_train_cust = X_train.pop('customer_id')
X_test_cust = X_test.pop('customer_id')

Fitting a logistic regression model:

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

Building a lime chart:

import lime
import lime.lime_tabular

explainer = lime.lime_tabular.LimeTabularExplainer(np.array(X_train),
                                feature_names=X_train.columns,
                                                  verbose=True, mode = 'classification')

exp = explainer.explain_instance(X_test.iloc[0], classifier.predict_proba)

exp.as_pyplot_figure()

enter image description here

The lime chart displays the encoded features/columns values. But I need the original value. For example, if the lime chart says 0, I need to display it as female. Could someone please let me know how fix it.

Karthik S
  • 11,348
  • 2
  • 11
  • 25

1 Answers1

1

You can use:

# Your direct mapping dictionary
dmap = {'gender': {'female' : 0, 'male' : 1},
        'age': {'young' : 0, 'middle aged' : 1},
        'salary': {'low salary': 0, 'medium salary': 1, 'high salary': 2}}

# Reverse mapping dictionary (not used hear)
rmap = {col: {v: k for k, v in dm.items()} for col, dm in dmap.items()}

# Categorical names, col0->gender, col1->age, col2->salary
cmap = {c: list(d.keys()) for c, d in enumerate(dmap.values())}


# Now use
explainer = lime.lime_tabular.LimeTabularExplainer(np.array(X_train),
                                feature_names=X_train.columns,
                                categorical_features=[0, 1, 2],  # <- 3 first columns
                                categorical_names=cmap,  # <- int to string
                                verbose=True, mode = 'classification')

exp = explainer.explain_instance(X_test.iloc[0], classifier.predict_proba)

exp.as_pyplot_figure()

enter image description here

Reat this tutorial

Corralien
  • 109,409
  • 8
  • 28
  • 52
  • I am getting this error with my actual data: `IndexError: list index out of range`. I have 19 columns in total, columns 4 to 19, I've encoded using: `df[col].map({v:k for k,v in dict(enumerate(df[col].unique())).items()})`. – Karthik S Feb 06 '23 at 12:18
  • This is my code: `explainer = lime.lime_tabular.LimeTabularExplainer(np.array(X_train), feature_names=X_train.columns, categorical_features=[4,5,6,7,8,9,10,11,12,13,14,15,16,17,18], # <- 3 first columns categorical_names=cmap, # <- int to string verbose=True, mode = 'classification')` – Karthik S Feb 06 '23 at 12:19
  • The `cmap` keys should match with `categorical_features`. If your first column is 4, the first key of `cmap` have to be 4. – Corralien Feb 06 '23 at 12:31
  • `categorical_features` list starts from column number 4, and `categorical_names` must also starts from column number 4? If so, could you please let me know how to do that – Karthik S Feb 06 '23 at 12:33
  • I think. Anyway, that's what I understood. `enumerate(..., 4)...` – Corralien Feb 06 '23 at 12:35
  • Could you please let me know what this does: `{c: list(d.keys()) for c, d in enumerate(dmap.values())}` – Karthik S Feb 06 '23 at 13:52
  • It converts, for example `'age': {'young' : 0, 'middle aged' : 1}` to `1: ['young', 'middle aged']`. Why 1? because `age` is the second column of `X_train`. You have to adapt to your real dataframe. – Corralien Feb 06 '23 at 14:15
  • Final query, could you please let me know if `1` is the column index then how `1: ['young', 'middle aged']` is correctly mapping string to appropriate indexes. – Karthik S Feb 06 '23 at 14:35
  • 1
    Yes it's your column index. If you manage correctly the mapping when you encode your labels, it should be right. You can also use `pd.factorize` to encode label as numeric value. – Corralien Feb 06 '23 at 14:40