sklearn serialize label encoder to disk for multiple categorical columns

Question

I have a model with several categorical features that need to be converted to numeric format. I am using a combination of LabelEncoder and OneHotEncoder to achieve this. Once in production, I need to apply the same encoding to new incoming data before the model can be used. I've saved on disk the model and the encoders using pickle. The problem here is that the LabelEncoder keeps only the last set of classes (for the last feature it has encoded), thus it can't be used to encode all the categorical features for the new data. To face this issue I am saving on disk a different LabelEncoder for each one of the categorical features, but this does not seem to scale very well to me, especially when you have a large number of categorical features.

What is the common practice for this situation? Is it possible to serialize and save just one encoder for all the categorical features to be used in production?

score 0 · Answer 1 · answered Oct 19 '20 at 01:28

If i understand your question well. I think you need to confirm a few things here.

the Schema of the DataFrame or the payload needs to be confirmed in Production.
once the schema is confirmed, you can always serialized these encoders. for example, you can store the encoder as Dict(). cate_encoders = {"feature_1": LabelEncoder(), "feature_2": LabelEncoder(), "feature_3": OneHotEncoder()} and serialzied it.
for model in production, you could create a preprocessor class, for example
and then create model (load model from disk or s3) and do prediction...

class MyPreprocessor:
    def __init__(self):
        self.cate_transform = None
        self.num_transform = None
    def load_transform(self, cat_trans, num_trans):
        pass
    def transform(self):
        pass

sklearn serialize label encoder to disk for multiple categorical columns

1 Answers1