I am currently in the process of training a model using the LGBM algorithm, and the features I have include five categorical features and one textual feature.
For preprocessing, I use the CountVectorizer method for the categorical features. For the textual data, since it's in Persian language, I preprocess it using the "hazm" library, followed by employing the TF-IDF Vectorizer approach.
In the past, when using SDK1, the code execution time was around 6 minutes. However, since moving to SDK2 and incorporating MLflow, the execution time has significantly increased. Due to limited computational resources and my constraints, I'm forced to cancel the execution within 30 minutes.
At present, I'm unsure of the root cause of the issue, as I haven't altered the preprocessing methods, the model, or the data itself.
However, I have noticed that the execution time increases significantly when adding the textual feature to the model. Currently, I'm unsure which aspect I should investigate.
Preprocess of text features:
from sklearn.base import BaseEstimator, TransformerMixin
import hazm
import re
class text_preprocess(BaseEstimator, TransformerMixin):
def __init__(self, columns=None):
self.columns = columns
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
text = X[self.columns]
refined_text = text_transformer(text)
return refined_text
def text_transformer(text):
normalizer = hazm.Normalizer()
tokenizer = hazm.WordTokenizer(separate_emoji = True, replace_hashtags = True, join_verb_parts = False)
lemmatizer = hazm.Lemmatizer()
text_list=[]
for i in range(len(text)):
if (text.iloc[i][0] == None) or (not text.iloc[i][0]):
text.iloc[i][0] = " "
text_list.append(text.iloc[i][0])
continue
pattern = r'[0-9,۰-۹]'
removed_number = re.sub(pattern, '', text.iloc[i][0])
normal = normalizer.normalize(removed_number)
tokens = tokenizer.tokenize(normal)
lemmate = [lemmatizer.lemmatize(token) for token in tokens]
string= ' '.join(lemmate)
text_list.append(string)
return text_list
There's another important point to mention. I've added several log statements at different parts of the code to track where more time is being consumed. However, it seems that the code isn't entering the main section, preventing it from logging anything on the Azureml.