Questions tagged [tf-idf]

“Term-frequency ⨉ Inverse Document Frequency”, or “tf-idf”, measures how important a word is to a document in a collection or corpus.

“Term-frequency ⨉ Inverse Document Frequency”, or “tf-idf”, in Natural Language Processing () and , measures how important a word is to a document in a collection or corpus.

References:

1326 questions
11
votes
1 answer

How does TfidfVectorizer compute scores on test data

In scikit-learn TfidfVectorizer allows us to fit over training data, and later use the same vectorizer to transform over our test data. The output of the transformation over the train data is a matrix that represents a tf-idf score for each word for…
Yuval Cohen
  • 131
  • 1
  • 5
11
votes
1 answer

Effects of Stemming on the term frequency?

How are the term frequencies (TF), and inverse document frequency (IDF), affected by stop-word removal and stemming? Thanks!
Ataman
  • 2,530
  • 3
  • 22
  • 34
10
votes
4 answers

Using sklearn how do I calculate the tf-idf cosine similarity between documents and a query?

My goal is to input 3 queries and find out which query is most similar to a set of 5 documents. So far I have calculated the tf-idf of the documents doing the following: from sklearn.feature_extraction.text import TfidfVectorizer def…
OultimoCoder
  • 244
  • 2
  • 7
  • 24
10
votes
3 answers

Select top n TFIDF features for a given document

I am working with TFIDF sparse matrices for document classification and want to retain only the top n (say 50) terms for each document (ranked by TFIDF score). See EDIT below. import numpy as np import pandas as pd from…
ongenz
  • 890
  • 1
  • 10
  • 20
10
votes
2 answers

Calculate TF-IDF using sklearn for n-grams in python

I have a vocabulary list that include n-grams as follows. myvocabulary = ['tim tam', 'jam', 'fresh milk', 'chocolates', 'biscuit pudding'] I want to use these words to calculate TF-IDF values. I also have a dictionary of corpus as follows (key =…
user8566323
10
votes
1 answer

how to choose parameters in TfidfVectorizer in sklearn during unsupervised clustering

TfidfVectorizer provides an easy way to encode & transform texts into vectors. My question is how to choose the proper values for parameters such as min_df, max_features, smooth_idf, sublinear_tf? update: Maybe I should have put more details on the…
user6396
  • 1,832
  • 6
  • 23
  • 38
10
votes
2 answers

max_df corresponds to documents than min_df error in Ridge classifier

I trained the ridge classifier with a huge amount of data ,used tfidf vecotrizer to vectorize data and it used to work fine. But now i am facing an error 'max_df corresponds to < documents than min_df' The data is stored in Mongodb. I tried…
athi_nn
  • 101
  • 1
  • 1
  • 6
10
votes
1 answer

tf-idf documents of different length

i have searched the web about normalizing tf grades on cases when the documents' lengths are very different (for example, having the documents lengths vary from 500 words to 2500 words) the only normalizing i've found talk about dividing the term…
Shahaf Stein
  • 165
  • 2
  • 14
10
votes
1 answer

Getting TF-IDF Scores Of Words Using Gensim

I am trying to find the most important words in a corpus based on their TF-IDF scores. Been following along the example at https://radimrehurek.com/gensim/tut2.html. Based on >>> for doc in corpus_tfidf: ... print(doc) the TF-IDF score is…
user799188
  • 13,965
  • 5
  • 35
  • 37
10
votes
2 answers

Pickle Tfidfvectorizer along with a custom tokenizer

I'm using a costume tokenizer to pass to TfidfVectorizer. That tokenizer depends on an external class TermExtractor, which is in another file. I basically want to build a TfidVectorizer based on certain terms, and not all single words/tokens. Here…
David Batista
  • 3,029
  • 2
  • 23
  • 42
9
votes
2 answers

Accuracy with TF-IDF and non-TF-IDF features

I run a Random Forest algorithm with TF-IDF and non-TF-IDF features. In total the features are around 130k in number (after a feature selection conducted on the TF-IDF features) and the observations of the training set are around 120k in…
Outcast
  • 4,967
  • 5
  • 44
  • 99
9
votes
2 answers

TD-IDF Find Cosine Similarity Between New Document and Dataset

I have a TF-IDF matrix of a dataset of products: tfidf = TfidfVectorizer().fit_transform(words) where words is a list of descriptions. This produces a 69258x22024 matrix. Now I want to find cosine similarities between a new product and the ones in…
Mohamed Oun
  • 561
  • 1
  • 9
  • 24
9
votes
3 answers

Document similarity: Vector embedding versus Tf-Idf performance?

I have a collection of documents, where each document is rapidly growing with time. The task is to find similar documents at any fixed time. I have two potential approaches: A vector embedding (word2vec, GloVe or fasttext), averaging over word…
Alec Matusis
  • 781
  • 1
  • 7
  • 16
9
votes
2 answers

data frame of tfidf with Python

I have to classify some sentiments my data frame is like this Phrase Sentiment is it good movie positive wooow is it very goode positive bad movie negative I did some preprocessing as…
Amal Kostali Targhi
  • 907
  • 3
  • 11
  • 22
9
votes
2 answers

How to classify new documents with tf-idf?

If I use the TfidfVectorizer from sklearn to generate feature vectors as: features = TfidfVectorizer(min_df=0.2, ngram_range=(1,3)).fit_transform(myDocuments) How would I then generate feature vectors to classify a new document? Since you cant…
Isbister
  • 906
  • 1
  • 12
  • 30