I am trying to do text classification with TPOT. I know you can save the vocabulary for the TfidfVectors but I am having some issues with getting the results for my model.
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from tpot import TPOTClassifier
x_sentences = ["hello world", "how are you", ...]
y_classes = [1, 2, ...]
tfidfconverter = TfidfVectorizer(max_features=500, min_df=5, max_df=0.7)
X = tfidfconverter.fit_transform(x_sentences).toarray()
X_train, X_test, y_train, y_test = train_test_split(X, y_classes, test_size=0.2, random_state=42)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
model = TPOTClassifier(generations=3, population_size=30, cv=cv, scoring='accuracy', verbosity=2, random_state=1, n_jobs=-1)
model.fit(X_train, y_train)
# Return prediction for new sample
sample = tfidfconverter.transform(["hello friend"])
print(model.predict(sample))
I want my model to accept words that are not on the original dataset. I am not sure if I have to pad the sentences or how can I make it generalize to different values. I think using the same tfidconverter should be good enough. When I run the inference on a new sample it returns the following error:
ValueError: Not all operators in None supports sparse matrix. Please use "TPOT sparse" for sparse matrix.