Questions tagged [tf-idf]

“Term-frequency ⨉ Inverse Document Frequency”, or “tf-idf”, measures how important a word is to a document in a collection or corpus.

“Term-frequency ⨉ Inverse Document Frequency”, or “tf-idf”, in Natural Language Processing () and , measures how important a word is to a document in a collection or corpus.

References:

1326 questions
15
votes
7 answers

get cosine similarity between two documents in lucene

i have built an index in Lucene. I want without specifying a query, just to get a score (cosine similarity or another distance?) between two documents in the index. For example i am getting from previously opened IndexReader ir the documents with…
maiky
  • 3,503
  • 7
  • 28
  • 28
15
votes
1 answer

Trying to get tf-idf weighting working in R

I am trying to do some very basic text analysis with the tm package and get some tf-idf scores; I'm running OS X (though I've tried this on Debian Squeeze with the same result); I've got a directory (which is my working directory) with a couple text…
cforster
  • 577
  • 2
  • 7
  • 19
14
votes
3 answers

Append tfidf to pandas dataframe

I have the following pandas structure: col1 col2 col3 text 1 1 0 meaningful text 5 9 7 trees 7 8 2 text I'd like to vectorise it using a tfidf vectoriser. This, however, returns a parse matrix, which I can actually turn…
lte__
  • 7,175
  • 25
  • 74
  • 131
14
votes
4 answers

Train Model fails because 'list' object has no attribute 'lower'

I am training a classifier over tweets for sentiment analysis purposes. The code is the following: df = pd.read_csv('Trainded Dataset Sentiment.csv', error_bad_lines=False) df.head(5) #TWEET X = df[['SentimentText']].loc[2:50000] #SENTIMENT…
Alex
  • 1,447
  • 7
  • 23
  • 48
14
votes
2 answers

Adding New Text to Sklearn TFIDIF Vectorizer (Python)

Is there a function to add to the existing corpus? I've already generated my matrix, I'm looking to periodically add to the table without re-crunching the whole sha-bang e.g; articleList = ['here is some text blah blah','another text object', 'more…
Howard Zoopaloopa
  • 3,798
  • 14
  • 48
  • 87
13
votes
1 answer

Do I use the same Tfidf vocabulary in k-fold cross_validation

I am doing text classification based on TF-IDF Vector Space Model.I have only no more than 3000 samples.For the fair evaluation, I'm evaluating the classifier using 5-fold cross-validation.But what confuses me is that whether it is necessary to…
lx.F
  • 131
  • 1
  • 3
13
votes
1 answer

Computing separate tfidf scores for two different columns using sklearn

I'm trying to compute the similarity between a set of queries and a set a result for each query. I would like to do this using tfidf scores and cosine similarity. The issue that I'm having is that I can't figure out how to generate a tfidf matrix…
David
  • 1,398
  • 1
  • 14
  • 20
13
votes
2 answers

What exactly does 'use_idf' do when creating a TfidfTransformer in sklearn?

I am using the TfidfTransformer from the sklearn package in Python 2.7. As I was getting comfortable with the arguments, I became a bit confused about use_idf, as in: TfidfVectorizer(use_idf=False).fit_transform() What exactly…
Monica Heddneck
  • 2,973
  • 10
  • 55
  • 89
13
votes
1 answer

Elasticsearch score disable IDF

I'm using ES for searching a huge list of human names employing fuzzy search techniques. TF is applicable for scoring, but IDF is really not required for me in this case. This is really diluting the score. I still want TF and Field Norm to be…
user1189332
  • 1,773
  • 4
  • 26
  • 46
13
votes
1 answer

Elasticsearch word frequency and relations

I am wondering if it is possible at all to get the top ten most frequent words in an Elasticsearch field across an entire index or alias. Here is what I'm trying to do: I am indexing text documents extracted from various document types (Word,…
Zaid Amir
  • 4,727
  • 6
  • 52
  • 101
13
votes
4 answers

How do i visualize data points of tf-idf vectors for kmeans clustering?

I have a list of documents and the tf-idf score for each unique word in the entire corpus. How do I visualize that on a 2-d plot to give me a gauge of how many clusters I will need to run k-means? Here is my code: sentence_list=["Hi how are you",…
jxn
  • 7,685
  • 28
  • 90
  • 172
12
votes
1 answer

Confused with the return result of TfidfVectorizer.fit_transform

I wanted to learn more about NLP. I came across this piece of code. But I was confused about the outcome of TfidfVectorizer.fit_transform when the result is printed. I am familiar with what tfidf is but I could not understand what the numbers…
Huzo
  • 1,652
  • 1
  • 21
  • 52
12
votes
1 answer

pyspark: sparse vectors to scipy sparse matrix

I have a spark dataframe with a column of short sentences, and a column with a categorical variable. I'd like to perform tf-idf on the sentences, one-hot-encoding on the categorical variable and then output it to a sparse matrix on my driver once…
Luke
  • 6,699
  • 13
  • 50
  • 88
12
votes
2 answers

How to select stop words using tf-idf? (non english corpus)

I have managed to evaluate the tf-idf function for a given corpus. How can I find the stopwords and the best words for each document? I understand that a low tf-idf for a given word and document means that it is not a good word for selecting that…
11
votes
1 answer

When to use which base of log for tf-idf?

I'm working on a simple search engine where I use the TF-IDF formula to score how important a search word is. I see people using different bases for the formula, but I see no explanation for when to use which. Does it matter at all, and do you have…
Djaff
  • 173
  • 1
  • 10
1 2
3
88 89