Questions tagged [tf-idf]

“Term-frequency ⨉ Inverse Document Frequency”, or “tf-idf”, measures how important a word is to a document in a collection or corpus.

“Term-frequency ⨉ Inverse Document Frequency”, or “tf-idf”, in Natural Language Processing (nlp) and text-mining, measures how important a word is to a document in a collection or corpus.

References:

Tf idf - Wikipedia

1326 questions

votes

3 answers

How do I store a TfidfVectorizer for future use in scikit-learn?

I have a TfidfVectorizer that vectorizes collection of articles followed by feature selection. vectroizer = TfidfVectorizer() X_train = vectroizer.fit_transform(corpus) selector = SelectKBest(chi2, k = 5000 ) X_train_sel =…

python python-3.x scikit-learn tf-idf joblib

asked Sep 24 '15 at 15:14

user2161903

votes

4 answers

Python TfidfVectorizer throwing : empty vocabulary; perhaps the documents only contain stop words"

I'm trying to use Python's Tfidf to transform a corpus of text. However, when I try to fit_transform it, I get a value error ValueError: empty vocabulary; perhaps the documents only contain stop words. In [69]:…

python pandas scikit-learn tf-idf

asked Jan 05 '14 at 01:00

Max Song

1,607
2
18
26

votes

4 answers

User Warning: Your stop_words may be inconsistent with your preprocessing

I am following this document clustering tutorial. As an input I give a txt file which can be downloaded here. It's a combined file of 3 other txt files divided with a use of \n. After creating a tf-idf matrix I received this warning: ,,UserWarning:…

vectorization text-processing tf-idf stop-words stemming

asked Aug 03 '19 at 16:23

Karolina Andruszkiewicz

votes

3 answers

Computing TF-IDF on the whole dataset or only on training data?

In the chapter seven of this book "TensorFlow Machine Learning Cookbook" the author in pre-processing data uses fit_transform function of scikit-learn to get the tfidf features of text for training. The author gives all text data to the function…

python machine-learning scikit-learn nlp tf-idf

asked Dec 12 '17 at 17:34

keramat

4,328
6
25
38

votes

3 answers

How areTF-IDF calculated by the scikit-learn TfidfVectorizer

I run the following code to convert the text matrix to TF-IDF matrix. text = ['This is a string','This is another string','TFIDF computation calculation','TfIDF is the product of TF and IDF'] from sklearn.feature_extraction.text import…

nlp scikit-learn tf-idf

asked May 01 '16 at 11:16

prashanth

4,197
4
25
42

votes

1 answer

TF*IDF for Search Queries

Okay, so I have been following these two posts on TF*IDF but am little confused : http://css.dzone.com/articles/machine-learning-text-feature Basically, I want to create a search query that contains searches through multiple documents. I would like…

python nlp nltk scikit-learn tf-idf

asked Aug 11 '12 at 02:44

tabchas

1,374
2
18
37

votes

2 answers

What does a weighted word embedding mean?

In the paper that I am trying to implement, it says, In this work, tweets were modeled using three types of text representation. The first one is a bag-of-words model weighted by tf-idf (term frequency - inverse document frequency) (Section …

machine-learning nlp word2vec tf-idf word-embedding

asked Dec 09 '17 at 09:16

Dawn17

7,825
16
57
118

votes

1 answer

How to make TF-IDF matrix dense?

I am using TfidfVectorizer to convert a collection of raw documents to a matrix of TF-IDF features, which I then plan to input into a k-means algorithm (which I will implement). In that algorithm I will have to compute distances between centroids…

python scikit-learn cluster-analysis sparse-matrix tf-idf

asked Jan 31 '16 at 01:44

gsamaras

71,951
46
188
305

votes

1 answer

How to get word details from TF Vector RDD in Spark ML Lib?

I have created Term Frequency using HashingTF in Spark. I have got the term frequencies using tf.transform for each word. But the results are showing in this format. [,…

apache-spark apache-spark-mllib tf-idf apache-spark-ml

asked Aug 29 '15 at 11:46

Srini

3,334
6
29
64

votes

3 answers

TF-IDF implementations in python

What are the standard tf-idf implementations/api available in python? I've come across the one in nltk. I want to know the other libraries that provide this feature.

python nltk information-retrieval tf-idf

asked Nov 22 '13 at 08:56

scarecrow

6,624
5
20
39

votes

5 answers

SMOTE initialisation expects n_neighbors <= n_samples, but n_samples < n_neighbors

I have already pre-cleaned the data, and below shows the format of the top 4 rows: [IN] df.head() [OUT] Year cleaned 0 1909 acquaint hous receiv follow letter clerk crown... 1 1909 ask secretari state war…

scikit-learn knn tf-idf oversampling imblearn

asked Mar 20 '18 at 23:48

Dbercules

votes

3 answers

how do I normalise a solr/lucene score?

I am trying to work out how to improve the scoring of solr search results. My application needs to take the score from the solr results and display a number of “stars” depending on how good the result(s) are to the query. 5 Stars = almost/exact…

search lucene solr normalization tf-idf

asked Oct 21 '10 at 09:53

Grant Collins

1,781
5
31
47

votes

3 answers

how to use tf-idf with Naive Bayes?

As per my search regarding the query, that I am posting here, I have got many links which propose solution but haven't mentioned exactly how this is to be done. I have explored, for example, the following links : Link 1 Link 2 Link 3 Link 4 etc.…

python-2.7 tf-idf naivebayes

asked May 24 '16 at 06:07

POOJA GUPTA

2,295
7
32
60

votes

3 answers

How do I calculate TF-IDF of a query?

How do I calculate tf-idf for a query? I understand how to calculate tf-idf for a set of documents with following definitions: tf = occurances in document/ total words in document idf = log(#documents / #documents where term occurs But I don't…

search computer-science tf-idf data-retrieval

asked May 09 '16 at 00:13

Codarus

votes

3 answers

Cosine Similarity of Vectors of different lengths?

I'm trying to use TF-IDF to sort documents into categories. I've calculated the tf_idf for some documents, but now when I try to calculate the Cosine Similarity between two of these documents I get a traceback saying: #len(u)==201,…

python nlp similarity nltk tf-idf

asked Jun 25 '10 at 20:27

erikcw

10,787
15
58
75

Prev 1

…

88 89 Next