Questions tagged [tf-idf]

“Term-frequency ⨉ Inverse Document Frequency”, or “tf-idf”, measures how important a word is to a document in a collection or corpus.

“Term-frequency ⨉ Inverse Document Frequency”, or “tf-idf”, in Natural Language Processing () and , measures how important a word is to a document in a collection or corpus.

References:

1326 questions
0
votes
1 answer

Dealing with homographs when counting n-grams in scikit-learn

I'm using TfIdfVectorizer to count n-grams in the text, but I need to lemmatize it first. One written form can correspond to different lemmas, so all of them should be counted. How can I deal with it within scikit-learn context? Do I need to write…
lizarisk
  • 7,562
  • 10
  • 46
  • 70
0
votes
0 answers

TF-IDF for my documents yield 0

I got this tfidf from yebrahim and somehow my output document yield all 0 for the result . Any problem with this ? example of the output is hippo 0.0 hipper 0.0 hip 0.0 hint 0.0 hindsight 0.0 hill 0.0 hilarious 0.0 thanks for the…
0
votes
2 answers

getTermFreqVector() - NullPointerException

I am trying to get the tf of a set of documents using the following code: IndexReader r = IndexReader.open(FSDirectory.open(new File("index"))); TermFreqVector tfv = r.getTermFreqVector(root[i],"contents"); // where root[] contains the document IDs…
Faux Pas
  • 536
  • 1
  • 8
  • 20
0
votes
1 answer

how to calculate tf-idf?

i have a problem, i cant calculate the tf-idf with my actual code. This is an example of tf-idf: $tfidf = $term_frequency * // tf log( $total_document_count / $documents_with_term, 2); // idf I have the total documents, but i need…
PSilvestre
  • 177
  • 2
  • 12
0
votes
0 answers

Calculate DF using Lucene doesn't work

I have an index with 2 docs at the moment (will add some more after everything will work ok). I try to calculate the df for a specific term but I get all the time the total number of docs in the index as a result.for debug purpose I entered a unique…
user1864229
  • 53
  • 2
  • 4
  • 11
0
votes
2 answers

Best Feature Selection Algorithm For Document Classification

I am working on a document classification project. I am using tf-idf and centroid algorithms. But I need a dictionary, for using that algorithms. I have tried information gain for maikng a dictionary but I think it's not satisfied enough. Have you…
Yavuz
  • 1,257
  • 1
  • 16
  • 32
0
votes
1 answer

how to get solr termVectorComponent results using solrj

I am trying to write this query; localhost/solr/tvrh/?q=queryString&version=2.2&indent=on&tv.tf_idf=true using solrj. I want to get tf and idf values below; test20508
yns
  • 440
  • 2
  • 8
  • 28
0
votes
1 answer

How to do K-means with normalized TF-IDF

I want some guidance here. I've just been trying to normalize the TF-IDF results for my project. So, I am thinking ahead what's next after TF-IDF? I wanted to do k-means clustering onto those normalized TF-IDF but is it the time already? before this…
Dan
  • 810
  • 2
  • 11
  • 29
0
votes
0 answers

Whoosh for Non-Boolean Search Queries

I am building a question answering system, and to speed up the process I want an IR system to return a set of documents from a corpus likely to hold the answer to that question (and my NLP algorithm will try to figure out the answer from the full…
Chet
  • 21,375
  • 10
  • 40
  • 58
0
votes
1 answer

NLP - Improving Running Time and Recall of Fuzzy string matching

I have made a working algorithm but the running time is very horrible. Yes, I know from the start that it will be horrible but not that much. For just 200000 records, the program runs for more than an hour. Basically what I am doing is: for each…
0
votes
1 answer

java - how to implement Cosine Similarity with tf*idf score of the document?

I have a set of documents in which I am searching for my keyword. I have calculated the tf-idf values for the keyword and all the documents. Suppose, I am storing my tf-idf value in an array for all the documents, how do I use it to calculate my…
Aravind Chinta
  • 71
  • 1
  • 4
  • 9
-1
votes
2 answers

Transition Probability Matrix calculation for sentences

I have sentences stored as strings extracted from a document. I want to apply standard cosine similarity to sentences. How do i go about doing it?
-1
votes
0 answers

Calling tfidf transform from ibm Watson studio api

I have deployed an ml model on IBM cloud. But when calling its api through flask it only works for .predict() method which basically is okay for a ml model, but my problem is that the input data is a text and it needs to be transformed with…
-1
votes
1 answer

why after TfidfVectorizer i have X has 24 features, but PassiveAggressiveClassifier is expecting 113905 features as input

I'm trying to use TfidfVectorizer on array with one example and use it for model prediction, but after TfidfVectorizer i get: <1x24 sparse matrix of type '' with 24 stored elements in Compressed Sparse Row format> insted…
Bocley
  • 1
  • 2
-1
votes
1 answer

Python nested dictionary Issue when iterating

I have 5 list of words, which basically act as values in a dictionary where the keys are the IDs of the documents. For each document, I would like to apply some calculations and display the values and results of the calculation in a nested…