0

I have data on a user's birthplace, specifically a city. Since I have a few thousand cities in my dataset, I looked for alternatives of OneHot encoding, since I didn't want to add thousands of columns to my dataset for a single column. I found BaseN encoding is a good alternative to OneHot, so I went with that. I encoded my data with base 4, so instead of a string column City now I have cumeric columns City_0, City_1 etc.

However, after modeling the dataset with a Random Forest Classifier, I have found that certain City_# variables are amongst the most important features. But how do I interpret this result? Since the cities have been encoded into 4 separate columns, how can I draw an actual conclusion (e.g. which cities impact my target variable the most)? Is there a method, or did I completely lose interpretability by encoding the cities this way?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
lte__
  • 7,175
  • 25
  • 74
  • 131
  • I’m voting to close this question because it is not about programming as defined in the [help] but about ML theory and/or methodology - please the intro and NOTE in https://stackoverflow.com/tags/machine-learning/info – desertnaut Oct 03 '21 at 22:31

1 Answers1

0

You could export your pipeline into PMML data format using the SkLearn2PMML package; during conversion, the BaseN encoding is undone so that it will be easy to see which cities flow in which direction.

Conversion example:

from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipline import PMMLPipeline

mapper = ColumnTransformer([
  ("cat", BaseNEncoder(base = 4), cat_cols),
  ("cont", "passthrough", cont_cols)
])
classifier = RandomForestClassifier()
pipeline = PMMLPipeline([
  ("mapper", mapper),
  ("classifier", classifier)
])
pipeline.fit(X, y)
pipeline.pmml_feature_importances_ = classifier.feature_importances_
pipeline.configure(numeric = True)
sklearn2pmml(pipeline, "MyInterpretablePipeline.pmml")

If you toggle the conversion option to numeric = False, then BaseN encoding will be completely undone so that city names are directly embedded into the RF data structure.

In any case, the PMML is the most human-friendly data format for persisting fitted ML pipelines. PMML is a subset of XML, so these files can be opened, viewed and edited using any text editor.

user1808924
  • 4,563
  • 2
  • 17
  • 20