I'm venturing into a new topic and experimenting with categorising product names. Without deeper knowledge, the use of MultinomialNB (superficially) already yielded quite good results for my use case.
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
df = pd.DataFrame({
'title':['short shirt', 'long shirt','green shoe','cool sneaker','heavy ballerinas'],
'label':['shirt','shirt','shoe','shoe','shoe']
})
count_vec = CountVectorizer()
bow = count_vec.fit_transform(df['title'])
bow = np.array(bow.todense())
X = bow
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)
model = MultinomialNB().fit(X_train, y_train)
model.predict(X_test)
Based on the trainigs of the above simplified example, I would like to categorise completely new titles and output them with the predicted labels:
new = pd.DataFrame({
'title':['long top', 'super shirt','white shoe','super cool sneaker','perfect fit ballerinas'],
'label': np.nan
})
Unfortunately, I am not sure of the next steps and would hope for some support.
...
count_vec = CountVectorizer()
bow = count_vec.fit_transform(new['title'])
bow = np.array(bow.todense())
model.predict(bow)