As you can see, I have a preprocessing function here and doing some converting operations. I have some categorical variables and I defined them as categorical_cols, and using LabelEncoder for them. My mission is, saving the LabelEncoder for later uses. The LabelEncoder works fine, there is no problem,
,
but when I save the LabelEncoder like this and try to use it in different preprocessing function by loading it;
---- LabelEncoder Save Side ----
for column in categorical_cols:
label_encoder = LabelEncoder()
taken_df[column] = label_encoder.fit_transform(taken_df[column])
label_encoders[column] = label_encoder
with open('label_encoders.pkl', 'wb') as file:
pickle.dump(label_encoders, file)
---- End ----
---- LabelEncoder Load Side ----
categorical_cols = ['from_city', 'to_city',"vehicle_type","trailer_type"]
with open('label_encoders.pkl', 'rb') as file:
label_encoders = pickle.load(file)
for column in categorical_cols:
test_df[column] = label_encoders[column].fit_transform(test_df[column])
---- End ----
It works, but the output is different like this,
everything is same, the used columns and even data is selected from original dataset for testing this issue. Therefore, my questions are;
Is it possible to save multiple columns and use it like this way or should I save every columns pickle file and use them as separetaly ?
Secondly, how can I solve this issue...
Here you can find my whole preprocessing function;
def preprocessed_data(taken_df):
used_cols = [....]
taken_df = taken_df[used_cols]
taken_df["weight"] = taken_df["weight"].str.replace(",",".")
taken_df["weight"] = taken_df["weight"].astype(float)
taken_df.dropna(inplace=True)
# Dealing with datetime columns
taken_df["offer_date"] = pd.to_datetime(taken_df["offer_date"]).dt.tz_localize(None).dt.tz_localize("UTC")
taken_df["cargo_load_date"] = pd.to_datetime(taken_df["cargo_load_date"]).dt.tz_localize(None).dt.tz_localize("UTC")
taken_df["cargo_delivery_date"] = pd.to_datetime(taken_df["cargo_delivery_date"]).dt.tz_localize(None).dt.tz_localize("UTC")
taken_df["vehicle_assignment_date"] = pd.to_datetime(taken_df["vehicle_assignment_date"]).dt.tz_localize(None).dt.tz_localize("UTC")
vehicle_types = {
"(?i).*(Tir|Tır).*":"TIR",
"(?i).*(Kamyon)":"Kamyon"
}
taken_df.loc[:,"vehicle_type"] = taken_df.loc[:,"vehicle_type"].replace(vehicle_types,regex=True)
# Extract the categorical columns
categorical_cols = ['from_city', 'to_city',"vehicle_type","trailer_type"]
label_encoders = {}
for column in categorical_cols:
label_encoder = LabelEncoder()
taken_df[column] = label_encoder.fit_transform(taken_df[column])
label_encoders[column] = label_encoder
with open('label_encoders.pkl', 'wb') as file:
pickle.dump(label_encoders, file)
# Factor weights
weight_factor = 0.6
delivery_time_factor = 0.4
offer_date_factor = 0.2
# Convert offer date as UNIX timestamp
taken_df['offer_date'] = pd.to_datetime(taken_df['offer_date'])
epoch = dt.datetime(1970, 1, 1, tzinfo=pytz.UTC)
taken_df['unix_offer_date'] = (taken_df['offer_date'] - epoch).dt.total_seconds()
# Convert delivery date as UNIX timestamp
taken_df['cargo_delivery_date'] = pd.to_datetime(taken_df['cargo_delivery_date'])
taken_df['unix_delivery_time'] = (taken_df['cargo_delivery_date'] - epoch).dt.total_seconds()
# min max scaling for normalization
scaler = MinMaxScaler()
# normalizing the weight column
taken_df['normalized_weight'] = scaler.fit_transform(taken_df['weight'].values.reshape(-1, 1))
# normalization of UNIX timestamps
taken_df['normalized_offer_date'] = scaler.fit_transform(taken_df['unix_offer_date'].values.reshape(-1, 1))
taken_df['normalized_delivery_time'] = scaler.fit_transform(taken_df['unix_delivery_time'].values.reshape(-1, 1))
with open('scaler.pkl', 'wb') as f:
pickle.dump(scaler, f)
# Calculation of priority score
taken_df['priority_score'] = (weight_factor * taken_df['normalized_weight']) + (offer_date_factor * taken_df['normalized_offer_date']) + (delivery_time_factor * taken_df['normalized_delivery_time'])
return taken_df
I have tried this way, but it didnt worked, too..
encoder = LabelEncoder()
for col in categorical_cols:
taken_df[col] = encoder.fit_transform(taken_df[col])
with open('encoder.pkl', 'wb') as f:
pickle.dump(encoder, f)