Encoding text in ML classifier

Question

I am trying to build a ML model. However I am having difficulties in understanding where to apply the encoding. Please see below the steps and functions to replicate the process I have been following.

First I split the dataset into train and test:

# Import the resampling package
from sklearn.naive_bayes import MultinomialNB
import string
from nltk.corpus import stopwords
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer
from sklearn.utils import resample
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score
# Split into training and test sets

# Testing Count Vectorizer

X = df[['Text']] 
y = df['Label']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)

# Returning to one dataframe
training_set = pd.concat([X_train, y_train], axis=1)

Now I apply the (under) sampling:

# Separating classes
spam = training_set[training_set.Label == 1]
not_spam = training_set[training_set.Label == 0]

# Undersampling the majority
undersample = resample(not_spam, 
                       replace=True, 
                       n_samples=len(spam), #set the number of samples to equal the number of the minority class
                       random_state=40)
# Returning to new training set
undersample_train = pd.concat([spam, undersample])

And I apply the algorithm chosen:

full_result = pd.DataFrame(columns = ['Preprocessing', 'Model', 'Precision', 'Recall', 'F1-score', 'Accuracy'])

X, y = BOW(undersample_train)
full_result = full_result.append(training_naive(X_train, X_test, y_train, y_test, 'Count Vectorize'), ignore_index = True)

where BOW is defined as follows

def BOW(data):
    
    df_temp = data.copy(deep = True)
    df_temp = basic_preprocessing(df_temp)

    count_vectorizer = CountVectorizer(analyzer=fun)
    count_vectorizer.fit(df_temp['Text'])

    list_corpus = df_temp["Text"].tolist()
    list_labels = df_temp["Label"].tolist()
    
    X = count_vectorizer.transform(list_corpus)
    
    return X, list_labels

basic_preprocessing is defined as follows:

def basic_preprocessing(df):
    
    df_temp = df.copy(deep = True)
    df_temp = df_temp.rename(index = str, columns = {'Clean_Titles_2': 'Text'})
    df_temp.loc[:, 'Text'] = [text_prepare(x) for x in df_temp['Text'].values]
    
    #le = LabelEncoder()
    #le.fit(df_temp['medical_specialty'])
    #df_temp.loc[:, 'class_label'] = le.transform(df_temp['medical_specialty'])
    
    tokenizer = RegexpTokenizer(r'\w+')
    df_temp["Tokens"] = df_temp["Text"].apply(tokenizer.tokenize)
    
    return df_temp

where text_prepare is:

def text_prepare(text):

    REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
    BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
    STOPWORDS = set(stopwords.words('english'))
    
    text = text.lower()
    text = REPLACE_BY_SPACE_RE.sub('', text) # replace REPLACE_BY_SPACE_RE symbols by space in text
    text = BAD_SYMBOLS_RE.sub('', text) # delete symbols which are in BAD_SYMBOLS_RE from text
    words = text.split()
    i = 0
    while i < len(words):
        if words[i] in STOPWORDS:
            words.pop(i)
        else:
            i += 1
    text = ' '.join(map(str, words))# delete stopwords from text
    
    return text

and

def training_naive(X_train_naive, X_test_naive, y_train_naive, y_test_naive, preproc):
    
    clf = MultinomialNB() # Gaussian Naive Bayes
    clf.fit(X_train_naive, y_train_naive)

    res = pd.DataFrame(columns = ['Preprocessing', 'Model', 'Precision', 'Recall', 'F1-score', 'Accuracy'])
    
    y_pred = clf.predict(X_test_naive)
    
    f1 = f1_score(y_pred, y_test_naive, average = 'weighted')
    pres = precision_score(y_pred, y_test_naive, average = 'weighted')
    rec = recall_score(y_pred, y_test_naive, average = 'weighted')
    acc = accuracy_score(y_pred, y_test_naive)
    
    res = res.append({'Preprocessing': preproc, 'Model': 'Naive Bayes', 'Precision': pres, 
                     'Recall': rec, 'F1-score': f1, 'Accuracy': acc}, ignore_index = True)

    return res

As you can see the order is:

define text_prepare for text cleaning;
define basic_preprocessing;
define BOW;
split the dataset into train and test;
apply the sampling;
apply the algorithm.

What I am not understanding is how to encode text correctly in order to make the algorithm working fine. My dataset is called df and columns are:

Label      Text                                 Year
1         bla bla bla                           2000
0         add some words                        2012
1         this is just an example               1998
0         unfortunately the code does not work  2018
0         where should I apply the encoding?    2000
0         What am I missing here?               2005

The order when I apply BOW is wrong as I get this error: ValueError: could not convert string to float: 'Expect a good results if ... '

I followed the steps (and code= from this link: kaggle.com/ruzarx/oversampling-smote-and-adasyn . However, the part of sampling is wrong as it should be done only to the train, so after the split. The principle should be: (1) split training/test; (2) apply resampling on the training set, so that the model is trained with balanced data; (3) apply model to test set and evaluate on it.

I will be happy to provide further information, data and/or code, but I think I have provided all the most relevant steps.

Thanks a lot.

Can you provide a full traceback? Which line in the BOW function throws an error? — Lars Vagnes, Dec 10 '20 at 08:13
I would say the problem is in count_vectorizer = CountVectorizer(analyzer=fun), in the fun function. Is it possible ? — Ezriel_S, Dec 10 '20 at 09:15
I do not think is caused by fun as it is just defined as follows: `def fun(text): remove_punc = [c for c in text if c not in string.punctuation] remove_punc = ''.join(remove_punc) cleaned = [w for w in remove_punc.split() if w.lower() not in stopwords.words('english')] return cleaned` — LdM, Dec 10 '20 at 11:36
from what i see the problem seems to be in this line `count_vectorizer.fit(df_temp['Text'])` you are passing text data to algo but it only works with integer or float data you should use some kind of encoding before passing to fit function you can try `LabelEncoder` or `OneHotEncoder` — Chandan, Dec 10 '20 at 18:45
I couldn't find the defn of `training_naive` in your question. Could you please add that? — Venkatachalam, Dec 11 '20 at 07:48
Also, it is always useful to simply your question with minimum code required to reproduce your problem. otherwise you might loose the potential answerers for your question. — Venkatachalam, Dec 11 '20 at 08:49
@Venkatachalam, I opened a new question on an error that I am getting from a similar function that I have used here. In case you want to have a look: https://stackoverflow.com/questions/65270921/dimension-mismatch-when-i-try-to-apply-tf-idf Thanks — LdM, Dec 13 '20 at 02:26

Venkatachalam · Accepted Answer · 2020-12-11T12:41:49.187

You need to have a test BOW function that should reuse the count vectorizer model that was built during the training phase.

Think about using pipeline for reducing the code verbosity.

from sklearn.naive_bayes import MultinomialNB
import string
from nltk.corpus import stopwords
import re
from sklearn.model_selection import train_test_split
from io import StringIO
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer
from sklearn.utils import resample
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score

def fun(text):
    remove_punc = [c for c in text if c not in string.punctuation]
    remove_punc = ''.join(remove_punc)
    cleaned = [w for w in remove_punc.split() if w.lower()
               not in stopwords.words('english')]
    return cleaned
# Testing Count Vectorizer

def BOW(data):

    df_temp = data.copy(deep=True)
    df_temp = basic_preprocessing(df_temp)

    count_vectorizer = CountVectorizer(analyzer=fun)
    count_vectorizer.fit(df_temp['Text'])

    list_corpus = df_temp["Text"].tolist()
    list_labels = df_temp["Label"].tolist()

    X = count_vectorizer.transform(list_corpus)

    return X, list_labels, count_vectorizer

def test_BOW(data, count_vectorizer):

    df_temp = data.copy(deep=True)
    df_temp = basic_preprocessing(df_temp)

    list_corpus = df_temp["Text"].tolist()
    list_labels = df_temp["Label"].tolist()

    X = count_vectorizer.transform(list_corpus)

    return X, list_labels

def basic_preprocessing(df):

    df_temp = df.copy(deep=True)
    df_temp = df_temp.rename(index=str, columns={'Clean_Titles_2': 'Text'})
    df_temp.loc[:, 'Text'] = [text_prepare(x) for x in df_temp['Text'].values]


    tokenizer = RegexpTokenizer(r'\w+')
    df_temp["Tokens"] = df_temp["Text"].apply(tokenizer.tokenize)

    return df_temp


def text_prepare(text):

    REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
    BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
    STOPWORDS = set(stopwords.words('english'))

    text = text.lower()
    # replace REPLACE_BY_SPACE_RE symbols by space in text
    text = REPLACE_BY_SPACE_RE.sub('', text)
    # delete symbols which are in BAD_SYMBOLS_RE from text
    text = BAD_SYMBOLS_RE.sub('', text)
    words = text.split()
    i = 0
    while i < len(words):
        if words[i] in STOPWORDS:
            words.pop(i)
        else:
            i += 1
    text = ' '.join(map(str, words))  # delete stopwords from text

    return text

s = """Label      Text                                 Year
1         bla bla bla                           2000
0         add some words                        2012
1         this is just an example               1998
0         unfortunately the code does not work  2018
0         where should I apply the encoding?    2000
0         What am I missing here?               2005"""


df = pd.read_csv(StringIO(s), sep='\s{2,}')


X = df[['Text']]
y = df['Label']


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=40)

# Returning to one dataframe
training_set = pd.concat([X_train, y_train], axis=1)
# Separating classes
spam = training_set[training_set.Label == 1]
not_spam = training_set[training_set.Label == 0]

# Undersampling the majority
undersample = resample(not_spam,
                       replace=True,
                       # set the number of samples to equal the number of the minority class
                       n_samples=len(spam),
                       random_state=40)
# Returning to new training set
undersample_train = pd.concat([spam, undersample])

full_result = pd.DataFrame(columns=['Preprocessing', 'Model', 'Precision',
                                    'Recall', 'F1-score', 'Accuracy'])
train_x, train_y, count_vectorizer  = BOW(undersample_train)
testing_set = pd.concat([X_test, y_test], axis=1)
test_x, test_y = test_BOW(testing_set, count_vectorizer)



def training_naive(X_train_naive, X_test_naive, y_train_naive, y_test_naive, preproc):
    
    clf = MultinomialNB() # Gaussian Naive Bayes
    clf.fit(X_train_naive, y_train_naive)

    res = pd.DataFrame(columns = ['Preprocessing', 'Model', 'Precision', 'Recall', 'F1-score', 'Accuracy'])
    
    y_pred = clf.predict(X_test_naive)
    
    f1 = f1_score(y_pred, y_test_naive, average = 'weighted')
    pres = precision_score(y_pred, y_test_naive, average = 'weighted')
    rec = recall_score(y_pred, y_test_naive, average = 'weighted')
    acc = accuracy_score(y_pred, y_test_naive)
    
    res = res.append({'Preprocessing': preproc, 'Model': 'Naive Bayes', 'Precision': pres, 
                     'Recall': rec, 'F1-score': f1, 'Accuracy': acc}, ignore_index = True)

    return res 

full_result = full_result.append(training_naive(train_x, test_x, train_y, test_y, 'Count Vectorize'), ignore_index = True)

Hi @Venkatachalam, please see the updated question with the part of the training_naive. I think the error is in encoding only train, and not test, so when I apply the algorithm to the test set, I got an error: `ValueError: could not convert string to float:...` . Could you please check if you have problem to apply training_naive? I would like to apply encoding in a way that can work and return me classification report (achievable only with test/pred). Thanks a lot. — LdM, Dec 11 '20 at 10:37
@LdM, I think you're still getting that error because you are trying to fit and test the Naïve Bayes model with the untransformed data. The data transformed by your approach using the BOW function `X, y = BOW(undersample_train)` isn't actually being used: `full_result = full_result.append(training_naive(X_train, X_test, y_train, y_test, 'Count Vectorize'), ignore_index = True)`. So in @Venkatachalam's answer, the Naïve Bayes model is being passed the transformed data, and that's why this now works. — user6386471, Dec 12 '20 at 21:47
And as @Venkatachalam pointed out, the correct use of the count vectorizer is to fit it to the training corpus (so that it picks up the vocabulary of the training data) and then transform both the training and test documents using the same count vectorizer model. The point of fitting on the training data and not the test data is to get a representative generalisation error of the classification model on real world data - if there is new vocabulary in unseen data, we'll want to get an understanding of how this affects the performance of the classification model. — user6386471, Dec 12 '20 at 21:55

Encoding text in ML classifier

1 Answers1

Linked