Questions tagged [tf-idf]

“Term-frequency ⨉ Inverse Document Frequency”, or “tf-idf”, measures how important a word is to a document in a collection or corpus.

“Term-frequency ⨉ Inverse Document Frequency”, or “tf-idf”, in Natural Language Processing () and , measures how important a word is to a document in a collection or corpus.

References:

1326 questions
-1
votes
1 answer

Reverse TF-IDF vector (vec2text)

Given a generated doc2vec vector on some document. is it possible to reverse the vector back to the original document? If so, does there exist any hash algorithm that would make the vector irreversible but still comparable to other vectors of the…
-1
votes
1 answer

NotFittedError: The TF-IDF vectorizer is not fitted

I've trained a sentiment analysis classifier using TripAdvisor's textual reviews datasets. It can predict the input textual reviews' rating based on sentiment. Everything is ok with the training and testing. However, when I loaded the classifier in…
-1
votes
1 answer

Text transform with sklearn TF-IDF vectorizer generates too big csv file

I have a 1000 texts each text has 200-1000 words. size of text csv file is about 10 MB. when I vectorize them with this code, the size of output CSV is exceptionally big (2.5 GB). I am not sure what I did wrong. Your help is highly appreciated.…
tursunWali
  • 71
  • 8
-1
votes
1 answer

cosine_sim between a text and a single column in a dataset

i have a dataset that i have to do lemmarization for it which i did below then i have to find similarity between 1 column "text " with the word " vaccine is deadly" but not sure how to use the cosine similarity function right i tried putting the…
-1
votes
1 answer

How to use TfidfVectorizer if I already have a list of keywords in a python df? What are the correct inputs?

I want to calculate the TF-IDF of keywords for a given genre. These keywords were never part of a text, they were already separated but in a different format. I extracted them from that format and put them into lists. The same with genres I had a df…
-1
votes
1 answer

How to apply tf-idf to rows of text

I have rows of blurbs (in text format) and I want to use tf-idf to define the weight of each word. Below is the code: def remove_punctuations(text): for punctuation in string.punctuation: text = text.replace(punctuation, '') return…
U108456
  • 17
  • 4
-1
votes
4 answers

Count frequency of a string individually from query

I want to search for a query from a file named a.java. If my query is String name I want to get the frequency of a string individually from the query from the text file. First I have to count the frequency of String and then name individually and…
-1
votes
1 answer

My nested for loops are taking so much time while calculating term-frequency

i have a list "total_vocabulary" with all the unique words in a collection of 56 documents. There is another list of list with words of every document "rest_doc". I want to calculate term frequency of each word from "total_vocabulary" in "rest_doc"…
-1
votes
1 answer

KNN for text classification, but train and class have different lengths in R

Hello I am trying to classify text, here is the code df <- read.csv("D:/AS/tokpedprepro.csv") #sampling set.seed(123) df <- df[sample(nrow(df)),] df <- df[sample(nrow(df)),] #Convert to corpus dfCorpus <-…
dikfaj
  • 1
  • 2
-1
votes
1 answer

TF-IDF Vectors Example (HELP)

Hey i made 3 different approaches but i can't decide which is the right way to use TF-IDF: The first code does fit and transform to both x_train and x_test separately giving (5000, 94462) (5000, 93007). The second code uses both train and test which…
-1
votes
1 answer

N_gram frequency python NTLK

I want to write a function that returns the frequency of each element in the n-gram of a given text. Help please. I did this code fo counting frequency of 2-gram code: from nltk import FreqDist from nltk.util import ngrams def…
Miss
  • 69
  • 1
  • 8
-1
votes
1 answer

How to fix 'int' object is not iterable in TF-IDF freqDict_list

I'm currently coding a TF-IDF program in python. I followed a code from this, however it's not working. The problem is 'int' object is not iterable. Traceback (most recent call last): File "C:/Users/Try Arie/PycharmProjects/TF-IDF/tf-idf.py", line…
Try
  • 41
  • 2
  • 9
-1
votes
1 answer

term frequency calculation using python

Finding term frequency for documents in a list using python l=['cat sat besides dog'] I have tried finding the term frequency for each word in the corpus. term freq=(no of times word occurred in document/total number of words in a document). I tried…
-1
votes
1 answer

What dimension reduction techniques can i try on my data (0-1 features+tfidf scores as features) before feeding it into svm

I have about 8000 features measuring a two level response variable i.e. output can belong to class 1 or 0. The 8000 features consist of about 3000 features with 0-1 values and about 5000 features (which are basically words from text data and their…
-1
votes
1 answer

Is there a way of removing all the words in the text that are not in other text?

I have a document with many reviews. I am creating a bag-of-words BW using TfidfVectorizer. What I want to do is: I only want to use words in BW that are also in other document D. The document D is a document with positive words. I am using this…