-1

I am currently in the process of training a model using the LGBM algorithm, and the features I have include five categorical features and one textual feature.

For preprocessing, I use the CountVectorizer method for the categorical features. For the textual data, since it's in Persian language, I preprocess it using the "hazm" library, followed by employing the TF-IDF Vectorizer approach.

In the past, when using SDK1, the code execution time was around 6 minutes. However, since moving to SDK2 and incorporating MLflow, the execution time has significantly increased. Due to limited computational resources and my constraints, I'm forced to cancel the execution within 30 minutes.

At present, I'm unsure of the root cause of the issue, as I haven't altered the preprocessing methods, the model, or the data itself.

However, I have noticed that the execution time increases significantly when adding the textual feature to the model. Currently, I'm unsure which aspect I should investigate.

Preprocess of text features:

from sklearn.base import BaseEstimator, TransformerMixin
import hazm
import re

class text_preprocess(BaseEstimator, TransformerMixin):
   def __init__(self, columns=None):
       self.columns = columns

   def fit(self, X, y=None):
       return self

   def transform(self, X, y=None):
       text = X[self.columns]
       refined_text = text_transformer(text)
       return refined_text

def text_transformer(text):

   normalizer = hazm.Normalizer()
   tokenizer = hazm.WordTokenizer(separate_emoji = True, replace_hashtags = True, join_verb_parts = False)
   lemmatizer = hazm.Lemmatizer()

   text_list=[]

   for i in range(len(text)):
       if (text.iloc[i][0] == None) or (not text.iloc[i][0]):
           text.iloc[i][0] = " "
           text_list.append(text.iloc[i][0])
           continue

       pattern = r'[0-9,۰-۹]'
       removed_number = re.sub(pattern, '', text.iloc[i][0])
       normal = normalizer.normalize(removed_number)
       tokens = tokenizer.tokenize(normal)
       lemmate = [lemmatizer.lemmatize(token) for token in tokens]
       string= ' '.join(lemmate)
       text_list.append(string)

   return text_list

There's another important point to mention. I've added several log statements at different parts of the code to track where more time is being consumed. However, it seems that the code isn't entering the main section, preventing it from logging anything on the Azureml.

Enter image description here

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
saraafr
  • 1
  • 4
  • 19

0 Answers0