Questions tagged [tf-idf]

“Term-frequency ⨉ Inverse Document Frequency”, or “tf-idf”, measures how important a word is to a document in a collection or corpus.

“Term-frequency ⨉ Inverse Document Frequency”, or “tf-idf”, in Natural Language Processing () and , measures how important a word is to a document in a collection or corpus.

References:

1326 questions
22
votes
3 answers

How do I store a TfidfVectorizer for future use in scikit-learn?

I have a TfidfVectorizer that vectorizes collection of articles followed by feature selection. vectroizer = TfidfVectorizer() X_train = vectroizer.fit_transform(corpus) selector = SelectKBest(chi2, k = 5000 ) X_train_sel =…
user2161903
  • 577
  • 1
  • 6
  • 22
21
votes
4 answers

Python TfidfVectorizer throwing : empty vocabulary; perhaps the documents only contain stop words"

I'm trying to use Python's Tfidf to transform a corpus of text. However, when I try to fit_transform it, I get a value error ValueError: empty vocabulary; perhaps the documents only contain stop words. In [69]:…
Max Song
  • 1,607
  • 2
  • 18
  • 26
20
votes
4 answers

User Warning: Your stop_words may be inconsistent with your preprocessing

I am following this document clustering tutorial. As an input I give a txt file which can be downloaded here. It's a combined file of 3 other txt files divided with a use of \n. After creating a tf-idf matrix I received this warning: ,,UserWarning:…
20
votes
3 answers

Computing TF-IDF on the whole dataset or only on training data?

In the chapter seven of this book "TensorFlow Machine Learning Cookbook" the author in pre-processing data uses fit_transform function of scikit-learn to get the tfidf features of text for training. The author gives all text data to the function…
keramat
  • 4,328
  • 6
  • 25
  • 38
19
votes
3 answers

How areTF-IDF calculated by the scikit-learn TfidfVectorizer

I run the following code to convert the text matrix to TF-IDF matrix. text = ['This is a string','This is another string','TFIDF computation calculation','TfIDF is the product of TF and IDF'] from sklearn.feature_extraction.text import…
prashanth
  • 4,197
  • 4
  • 25
  • 42
18
votes
1 answer

TF*IDF for Search Queries

Okay, so I have been following these two posts on TF*IDF but am little confused : http://css.dzone.com/articles/machine-learning-text-feature Basically, I want to create a search query that contains searches through multiple documents. I would like…
tabchas
  • 1,374
  • 2
  • 18
  • 37
17
votes
2 answers

What does a weighted word embedding mean?

In the paper that I am trying to implement, it says, In this work, tweets were modeled using three types of text representation. The first one is a bag-of-words model weighted by tf-idf (term frequency - inverse document frequency) (Section …
Dawn17
  • 7,825
  • 16
  • 57
  • 118
17
votes
1 answer

How to make TF-IDF matrix dense?

I am using TfidfVectorizer to convert a collection of raw documents to a matrix of TF-IDF features, which I then plan to input into a k-means algorithm (which I will implement). In that algorithm I will have to compute distances between centroids…
gsamaras
  • 71,951
  • 46
  • 188
  • 305
17
votes
1 answer

How to get word details from TF Vector RDD in Spark ML Lib?

I have created Term Frequency using HashingTF in Spark. I have got the term frequencies using tf.transform for each word. But the results are showing in this format. [,
Srini
  • 3,334
  • 6
  • 29
  • 64
17
votes
3 answers

TF-IDF implementations in python

What are the standard tf-idf implementations/api available in python? I've come across the one in nltk. I want to know the other libraries that provide this feature.
scarecrow
  • 6,624
  • 5
  • 20
  • 39
16
votes
5 answers

SMOTE initialisation expects n_neighbors <= n_samples, but n_samples < n_neighbors

I have already pre-cleaned the data, and below shows the format of the top 4 rows: [IN] df.head() [OUT] Year cleaned 0 1909 acquaint hous receiv follow letter clerk crown... 1 1909 ask secretari state war…
Dbercules
  • 629
  • 1
  • 9
  • 26
16
votes
3 answers

how do I normalise a solr/lucene score?

I am trying to work out how to improve the scoring of solr search results. My application needs to take the score from the solr results and display a number of “stars” depending on how good the result(s) are to the query. 5 Stars = almost/exact…
Grant Collins
  • 1,781
  • 5
  • 31
  • 47
16
votes
3 answers

how to use tf-idf with Naive Bayes?

As per my search regarding the query, that I am posting here, I have got many links which propose solution but haven't mentioned exactly how this is to be done. I have explored, for example, the following links : Link 1 Link 2 Link 3 Link 4 etc.…
POOJA GUPTA
  • 2,295
  • 7
  • 32
  • 60
16
votes
3 answers

How do I calculate TF-IDF of a query?

How do I calculate tf-idf for a query? I understand how to calculate tf-idf for a set of documents with following definitions: tf = occurances in document/ total words in document idf = log(#documents / #documents where term occurs But I don't…
Codarus
  • 437
  • 1
  • 5
  • 16
15
votes
3 answers

Cosine Similarity of Vectors of different lengths?

I'm trying to use TF-IDF to sort documents into categories. I've calculated the tf_idf for some documents, but now when I try to calculate the Cosine Similarity between two of these documents I get a traceback saying: #len(u)==201,…
erikcw
  • 10,787
  • 15
  • 58
  • 75
1
2
3
88 89