Python tfidf returning same values regardless of idf

Question

I am trying to build a small program that calculates the tfidf in python. There are two very nice tutorials which I have used (I have code from here and another function from kaggle )

import nltk
import string
import os
from bs4 import *
import re
from nltk.corpus import stopwords # Import the stop word list
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.porter import PorterStemmer

path = 'my/path'
token_dict = {}
stemmer = PorterStemmer()

def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems

def review_to_words( raw_review ):
    # 1. Remove HTML
    review_text = BeautifulSoup(raw_review).get_text() 
    # 2. Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", review_text) 
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()                             
    # 4. In Python, searching a set is much faster than searching
    #   a list, so convert the stop words to a set
    stops = set(stopwords.words("english"))                  
    # 5. Remove stop words
    meaningful_words = [w for w in words if not w in stops]   
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return( " ".join( meaningful_words ))  



for subdir, dirs, files in os.walk(path):
    for file in files:
        file_path = subdir + os.path.sep + file
        shakes = open(file_path, 'r')
        text = shakes.read()
        token_dict[file] = review_to_words(text)

tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')
tfs = tfidf.fit_transform(token_dict.values())


str = 'this sentence has unseen text such as computer but also king  lord lord  this this and that lord juliet'#teststring
response = tfidf.transform([str])

feature_names = tfidf.get_feature_names()
for col in response.nonzero()[1]:
    print feature_names[col], ' - ', response[0, col]

The code seems to work fine but then I have a look at the results.

thi  -  0.612372435696
text  -  0.204124145232
sentenc  -  0.204124145232
lord  -  0.612372435696
king  -  0.204124145232
juliet  -  0.204124145232
ha  -  0.204124145232
comput  -  0.204124145232

The IDFs seem to be the same for all the words because the TFIDFs are just n*0.204. I have checked with tfidf.idf_ and this seems to be the case.

Is there something in the method that I have not implemented correctly? Do you know why the idf_s are the same?

Examining your code I haven't found for certain what could be wrong. I did find something odd though. Why are you stripping stop words twice? Once in your `review_to_words()` function and also when you initialize `TfidfVectorizer`. — Phillip Martin, Apr 26 '16 at 17:06

score 1 · Answer 1 · answered Apr 27 '16 at 03:39

Since you provided a list containing 1 document, all terms idfs will have an equal 'binary frequency'.

idf is the inverted term frequency over the set of documents (or just inverted document frequency). Most if not all idf formulas only checks for term presence in a document, so it does not matter how many times it appears per document.

Try feeding a list with 3 distinct documents for instance, this way the idfs will not be the same.

score 1 · Answer 2 · answered May 01 '16 at 14:26

The inverse document frequency of a term t is calculated as follows.

N is the total number of documents and df_t is the number of documents where the term t appears.

In this case, your program has one document (str variable). Therefore, both N and df_t equal 1. As a result, the IDF for all terms are the same.

Python tfidf returning same values regardless of idf

2 Answers2