How to dump Label Encoder values for multiple columns in a dataframe

Question

As you can see, I have a preprocessing function here and doing some converting operations. I have some categorical variables and I defined them as categorical_cols, and using LabelEncoder for them. My mission is, saving the LabelEncoder for later uses. The LabelEncoder works fine, there is no problem, ,

but when I save the LabelEncoder like this and try to use it in different preprocessing function by loading it;

---- LabelEncoder Save Side ----

for column in categorical_cols:
        label_encoder = LabelEncoder()
        taken_df[column] = label_encoder.fit_transform(taken_df[column])
        label_encoders[column] = label_encoder
        
    with open('label_encoders.pkl', 'wb') as file:
        pickle.dump(label_encoders, file)

---- End ----

---- LabelEncoder Load Side ----

categorical_cols = ['from_city', 'to_city',"vehicle_type","trailer_type"]
    
with open('label_encoders.pkl', 'rb') as file:
     label_encoders = pickle.load(file)

for column in categorical_cols:
     test_df[column] = label_encoders[column].fit_transform(test_df[column])

---- End ----

It works, but the output is different like this,

everything is same, the used columns and even data is selected from original dataset for testing this issue. Therefore, my questions are;

Is it possible to save multiple columns and use it like this way or should I save every columns pickle file and use them as separetaly ?
Secondly, how can I solve this issue...

Here you can find my whole preprocessing function;

def preprocessed_data(taken_df):
    
    
    used_cols = [....]
    taken_df = taken_df[used_cols]
    taken_df["weight"] = taken_df["weight"].str.replace(",",".")
    taken_df["weight"] = taken_df["weight"].astype(float)
    taken_df.dropna(inplace=True)
    
    # Dealing with datetime columns
    taken_df["offer_date"] = pd.to_datetime(taken_df["offer_date"]).dt.tz_localize(None).dt.tz_localize("UTC")
    taken_df["cargo_load_date"] = pd.to_datetime(taken_df["cargo_load_date"]).dt.tz_localize(None).dt.tz_localize("UTC")
    taken_df["cargo_delivery_date"] = pd.to_datetime(taken_df["cargo_delivery_date"]).dt.tz_localize(None).dt.tz_localize("UTC")
    taken_df["vehicle_assignment_date"] = pd.to_datetime(taken_df["vehicle_assignment_date"]).dt.tz_localize(None).dt.tz_localize("UTC")
    
     
    vehicle_types = {
        "(?i).*(Tir|Tır).*":"TIR",
        "(?i).*(Kamyon)":"Kamyon"
    }
    
    taken_df.loc[:,"vehicle_type"] = taken_df.loc[:,"vehicle_type"].replace(vehicle_types,regex=True)
      
    # Extract the categorical columns
    categorical_cols = ['from_city', 'to_city',"vehicle_type","trailer_type"]
    
    label_encoders = {}
    
    for column in categorical_cols:
        label_encoder = LabelEncoder()
        taken_df[column] = label_encoder.fit_transform(taken_df[column])
        label_encoders[column] = label_encoder
        
    with open('label_encoders.pkl', 'wb') as file:
        pickle.dump(label_encoders, file)

    # Factor weights
    weight_factor = 0.6
    delivery_time_factor = 0.4
    offer_date_factor = 0.2
    
    # Convert offer date as UNIX timestamp
    taken_df['offer_date'] = pd.to_datetime(taken_df['offer_date'])
    epoch = dt.datetime(1970, 1, 1, tzinfo=pytz.UTC)
    taken_df['unix_offer_date'] = (taken_df['offer_date'] - epoch).dt.total_seconds()
    
    # Convert delivery date as UNIX timestamp
    taken_df['cargo_delivery_date'] = pd.to_datetime(taken_df['cargo_delivery_date'])
    taken_df['unix_delivery_time'] = (taken_df['cargo_delivery_date'] - epoch).dt.total_seconds()
    
    # min max scaling for normalization
    scaler = MinMaxScaler()
    
    # normalizing the weight column
    taken_df['normalized_weight'] = scaler.fit_transform(taken_df['weight'].values.reshape(-1, 1))
    
    # normalization of UNIX timestamps
    taken_df['normalized_offer_date'] = scaler.fit_transform(taken_df['unix_offer_date'].values.reshape(-1, 1))
    taken_df['normalized_delivery_time'] = scaler.fit_transform(taken_df['unix_delivery_time'].values.reshape(-1, 1))
    
    with open('scaler.pkl', 'wb') as f:
        pickle.dump(scaler, f)
    
    # Calculation of priority score
    taken_df['priority_score'] = (weight_factor * taken_df['normalized_weight']) + (offer_date_factor * taken_df['normalized_offer_date']) + (delivery_time_factor * taken_df['normalized_delivery_time'])
    
    

    return taken_df

I have tried this way, but it didnt worked, too..

 encoder = LabelEncoder()
    for col in categorical_cols:
        taken_df[col] = encoder.fit_transform(taken_df[col])
        
    with open('encoder.pkl', 'wb') as f:
        pickle.dump(encoder, f)

I think you should try to scope the question to a smaller problem. Also keep in mind this: This transformer should be used to encode target values, i.e. y, and not the input X. From sklearn docs https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html — Celius Stingher, Jun 11 '23 at 20:10
Thanks for your feedback, so should I use one hot encoding here, if yes how can I do same process in my code ? Thanks, regards @CeliusStingher — Zyxnon, Jun 11 '23 at 20:17
Indeed, consider replacing the LabelEncoding with sklearn's one hot encoding, or pd.get_dummies() too. — Celius Stingher, Jun 11 '23 at 20:20
I'll try but I need to ask one more thing about this. Lets assume I have a 50 different cities in my original df and by using get dummies, it will create 50 columns, but when I get only one row input from user, it will create just 1 column, what should I do, in order to process my user input dataset like my original dataframe ? Thanks, @CeliusStingher — Zyxnon, Jun 11 '23 at 20:32

How to dump Label Encoder values for multiple columns in a dataframe

0 Answers0