10

TfidfVectorizer provides an easy way to encode & transform texts into vectors.

My question is how to choose the proper values for parameters such as min_df, max_features, smooth_idf, sublinear_tf?

update:

Maybe I should have put more details on the question:

What if I am doing unsupervised clustering with bunch of texts. and I don't have any labels for the texts & I don't know how many clusters there might be (which is actually what I am trying to figure out)

David Batista
  • 3,029
  • 2
  • 23
  • 42
user6396
  • 1,832
  • 6
  • 23
  • 38
  • 1
    Look into "cross-validation". That decision process is called "hyperparameter tuning" because `min_df`, etc. are hyperparameters. – Arya McCarthy May 19 '17 at 10:01

1 Answers1

15

If you are, for instance, using these vectors in a classification task, you can vary these parameters (and of course also the parameters of the classifier) and see which values give you the best performance.

You can do that in sklearn easily with the GridSearchCV and Pipeline objects

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words=stop_words)),
    ('clf', OneVsRestClassifier(MultinomialNB(
        fit_prior=True, class_prior=None))),
])
parameters = {
    'tfidf__max_df': (0.25, 0.5, 0.75),
    'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
    'clf__estimator__alpha': (1e-2, 1e-3)
}

grid_search_tune = GridSearchCV(pipeline, parameters, cv=2, n_jobs=2, verbose=3)
grid_search_tune.fit(train_x, train_y)

print("Best parameters set:")
print grid_search_tune.best_estimator_.steps
David Batista
  • 3,029
  • 2
  • 23
  • 42
  • Thank you for your detailed answer. Unfortunately, I am doing unsupervised clustering with a set of texts. and I don't even have any labels for possible clusters. what should I do? – user6396 May 20 '17 at 03:48
  • You can evaluate how good your clusters are, for instance: https://www.wikiwand.com/en/Cluster_analysis#/Evaluation_and_assessment, and check how the TfIdfVectorizer parameters also influence the results – David Batista May 20 '17 at 07:44