0

I am trying to do text classification with TPOT. I know you can save the vocabulary for the TfidfVectors but I am having some issues with getting the results for my model.

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from tpot import TPOTClassifier

x_sentences = ["hello world", "how are you", ...]
y_classes = [1, 2, ...]

tfidfconverter = TfidfVectorizer(max_features=500, min_df=5, max_df=0.7)
X = tfidfconverter.fit_transform(x_sentences).toarray()
X_train, X_test, y_train, y_test = train_test_split(X, y_classes, test_size=0.2, random_state=42)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
model = TPOTClassifier(generations=3, population_size=30, cv=cv, scoring='accuracy', verbosity=2, random_state=1, n_jobs=-1)
model.fit(X_train, y_train)

# Return prediction for new sample
sample = tfidfconverter.transform(["hello friend"])
print(model.predict(sample))

I want my model to accept words that are not on the original dataset. I am not sure if I have to pad the sentences or how can I make it generalize to different values. I think using the same tfidconverter should be good enough. When I run the inference on a new sample it returns the following error:

ValueError: Not all operators in None supports sparse matrix. Please use "TPOT sparse" for sparse matrix.
Juanvulcano
  • 1,354
  • 3
  • 26
  • 44

1 Answers1

2

It is exactly as the error states. You need to add an extra parameter to your TPOTClassifier object : config_dict = 'TPOT sparse'. You can read more about it under 'Built-in TPOT Configurations' at http://epistasislab.github.io/tpot/using/

nut_job
  • 21
  • 2