0

As you can see, I have a preprocessing function here and doing some converting operations. I have some categorical variables and I defined them as categorical_cols, and using LabelEncoder for them. My mission is, saving the LabelEncoder for later uses. The LabelEncoder works fine, there is no problem, enter image description here ,

but when I save the LabelEncoder like this and try to use it in different preprocessing function by loading it;

---- LabelEncoder Save Side ----

for column in categorical_cols:
        label_encoder = LabelEncoder()
        taken_df[column] = label_encoder.fit_transform(taken_df[column])
        label_encoders[column] = label_encoder
        
    with open('label_encoders.pkl', 'wb') as file:
        pickle.dump(label_encoders, file)

---- End ----

---- LabelEncoder Load Side ----

categorical_cols = ['from_city', 'to_city',"vehicle_type","trailer_type"]
    
with open('label_encoders.pkl', 'rb') as file:
     label_encoders = pickle.load(file)

for column in categorical_cols:
     test_df[column] = label_encoders[column].fit_transform(test_df[column])

---- End ----

It works, but the output is different like this, enter image description here

everything is same, the used columns and even data is selected from original dataset for testing this issue. Therefore, my questions are;

  • Is it possible to save multiple columns and use it like this way or should I save every columns pickle file and use them as separetaly ?

  • Secondly, how can I solve this issue...

Here you can find my whole preprocessing function;

def preprocessed_data(taken_df):
    
    
    used_cols = [....]
    taken_df = taken_df[used_cols]
    taken_df["weight"] = taken_df["weight"].str.replace(",",".")
    taken_df["weight"] = taken_df["weight"].astype(float)
    taken_df.dropna(inplace=True)
    
    # Dealing with datetime columns
    taken_df["offer_date"] = pd.to_datetime(taken_df["offer_date"]).dt.tz_localize(None).dt.tz_localize("UTC")
    taken_df["cargo_load_date"] = pd.to_datetime(taken_df["cargo_load_date"]).dt.tz_localize(None).dt.tz_localize("UTC")
    taken_df["cargo_delivery_date"] = pd.to_datetime(taken_df["cargo_delivery_date"]).dt.tz_localize(None).dt.tz_localize("UTC")
    taken_df["vehicle_assignment_date"] = pd.to_datetime(taken_df["vehicle_assignment_date"]).dt.tz_localize(None).dt.tz_localize("UTC")
    
     
    vehicle_types = {
        "(?i).*(Tir|Tır).*":"TIR",
        "(?i).*(Kamyon)":"Kamyon"
    }
    
    taken_df.loc[:,"vehicle_type"] = taken_df.loc[:,"vehicle_type"].replace(vehicle_types,regex=True)
      
    # Extract the categorical columns
    categorical_cols = ['from_city', 'to_city',"vehicle_type","trailer_type"]
    
    label_encoders = {}
    
    for column in categorical_cols:
        label_encoder = LabelEncoder()
        taken_df[column] = label_encoder.fit_transform(taken_df[column])
        label_encoders[column] = label_encoder
        
    with open('label_encoders.pkl', 'wb') as file:
        pickle.dump(label_encoders, file)

    # Factor weights
    weight_factor = 0.6
    delivery_time_factor = 0.4
    offer_date_factor = 0.2
    
    # Convert offer date as UNIX timestamp
    taken_df['offer_date'] = pd.to_datetime(taken_df['offer_date'])
    epoch = dt.datetime(1970, 1, 1, tzinfo=pytz.UTC)
    taken_df['unix_offer_date'] = (taken_df['offer_date'] - epoch).dt.total_seconds()
    
    # Convert delivery date as UNIX timestamp
    taken_df['cargo_delivery_date'] = pd.to_datetime(taken_df['cargo_delivery_date'])
    taken_df['unix_delivery_time'] = (taken_df['cargo_delivery_date'] - epoch).dt.total_seconds()
    
    # min max scaling for normalization
    scaler = MinMaxScaler()
    
    # normalizing the weight column
    taken_df['normalized_weight'] = scaler.fit_transform(taken_df['weight'].values.reshape(-1, 1))
    
    # normalization of UNIX timestamps
    taken_df['normalized_offer_date'] = scaler.fit_transform(taken_df['unix_offer_date'].values.reshape(-1, 1))
    taken_df['normalized_delivery_time'] = scaler.fit_transform(taken_df['unix_delivery_time'].values.reshape(-1, 1))
    
    with open('scaler.pkl', 'wb') as f:
        pickle.dump(scaler, f)
    
    # Calculation of priority score
    taken_df['priority_score'] = (weight_factor * taken_df['normalized_weight']) + (offer_date_factor * taken_df['normalized_offer_date']) + (delivery_time_factor * taken_df['normalized_delivery_time'])
    
    

    return taken_df

I have tried this way, but it didnt worked, too..

 encoder = LabelEncoder()
    for col in categorical_cols:
        taken_df[col] = encoder.fit_transform(taken_df[col])
        
    with open('encoder.pkl', 'wb') as f:
        pickle.dump(encoder, f)
Zyxnon
  • 15
  • 1
  • 6
  • I think you should try to scope the question to a smaller problem. Also keep in mind this: This transformer should be used to encode target values, i.e. y, and not the input X. From sklearn docs https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html – Celius Stingher Jun 11 '23 at 20:10
  • Thanks for your feedback, so should I use one hot encoding here, if yes how can I do same process in my code ? Thanks, regards @CeliusStingher – Zyxnon Jun 11 '23 at 20:17
  • Indeed, consider replacing the LabelEncoding with sklearn's one hot encoding, or pd.get_dummies() too. – Celius Stingher Jun 11 '23 at 20:20
  • I'll try but I need to ask one more thing about this. Lets assume I have a 50 different cities in my original df and by using get dummies, it will create 50 columns, but when I get only one row input from user, it will create just 1 column, what should I do, in order to process my user input dataset like my original dataframe ? Thanks, @CeliusStingher – Zyxnon Jun 11 '23 at 20:32

0 Answers0