As per scikit-learn/sklearn's TfidfVectorizer
documentation (actually TfidfTransformer
, which is internally used to trasnform count matrix to a tf-idf
representation), the idf
formula:
is computed as idf(t) = log [ n / df(t) ] + 1
(if
smooth_idf=False
), where n is the total number of documents in the
document set and df(t) is the document frequency of t; the document
frequency is the number of documents in the document set that contain
the term t.
Note that the idf formula above differs from the standard textbook notation that defines the idf as
idf(t) = log [ n / (df(t) + 1) ]
.
If smooth_idf=True
(the default), the constant “1” is added to the
numerator and denominator of the idf as if an extra document was seen
containing every term in the collection exactly once, which prevents
zero divisions: idf(t) = log [ (1 + n) / (1 + df(t)) ] + 1.
In short, sklearn's TfidfVectorizer
uses a different formula from the standard one which is normally either idf(t) = log [ n / df(t) ]
or idf(t) = log [ n / (df(t) + 1) ]
(denominator is adjusted to prevent zero divisions, if a term is not in the corpus). Additionally:
Tf is "n" (natural) by default
meaning that sklearn uses as tf
the number of times a term 't'
appears in a document, not the relative frequency i.e., (number of times term 't' occurs in a document) / (number of terms in a document)
. Further, sklearn uses cosine similarity normalisation:
Normalization is “c” (cosine) when norm='l2'
For the reasons above, the results might differ from applying the standard tf-idf
formula. Additionally, when the corpus size is very small, frequently occuring words across the corpus will be given a high tf-idf
score. Whereas, words that are frequent in a document and rare in every other document should be given higher tf-idf
scores. It is very likely that if you remove the stopwords filter from TfidfVectorizer(stop_words= 'english')
, you will even see stop words being among the top-scoring words; whereas, tf-idf
is known for being used to remove stop words as well, as stop words are very common terms across a corpus, and thus, are usually given very low scores (on a side note, stop words might be considered noise for a particular dataset/domain, but could also be highly informative features for another dataset/domain. Thus, removing them or not should be based on experimentation and results analysis. On the other hand, if one considers generating word bigrams and/or trigrams besides single terms, stop words elimination would allow them to better match).
As mentioned above, this occurs when the corpus (documents' collection) size is rather small (seven documents, in your case). In this case, as explained here, it is very likely that several words appearing in every document of the corpus to be penalised the same way, causing frequently occuring words to have a higher tf-idf
score due to their higher term frequency. If, for example, the word "customer" occurs just as "people" in your corpus (i.e., both appear in the same number of documents), their idf
value will be the same; however, frequently occuring words (such as stop words, if not eliminated, or like "people" in your example), due to their higher term frequency tf
, they will be given higher tf-idf
scores than words such as "customer", which might appear in every document as well (as an example), but with lower term frequency. To demonstrate this, see the below using sklearn's TfidfVectorizer
(stop words filter was opted out on purpose). The sample data used below come from here. The function for returning the highest scoring words is based on this article (which I'd recommend having a look at it).
Using sklearn's TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np
df = pd.read_csv("Reviews.csv", usecols = ['Text'])
train = df.Text[:7]
#tfidf = TfidfVectorizer(analyzer= 'word', stop_words= 'english')
tfidf = TfidfVectorizer(analyzer= 'word')
Xtr = tfidf.fit_transform(train)
features = tfidf.get_feature_names_out()
# Get top n tfidf values in row and return them with their corresponding feature names
def top_tfidf_feats(Xtr, features, row_id, top_n=10):
row = np.squeeze(Xtr[row_id].toarray()) # convert the row into dense format first
topn_ids = np.argsort(row)[::-1][:top_n] # produce the indices that would order the row by tf-idf value, reverse them (into descending order), and select the top_n
top_feats = [(features[i], row[i]) for i in topn_ids]
df = pd.DataFrame(data=top_feats ,columns=['feature', 'tfidf'])
return df
top_feats_D1 = top_tfidf_feats(Xtr, features, 0)
print("Top features in D1\n", top_feats_D1, '\n')
top_feats_D2 = top_tfidf_feats(Xtr, features, 1)
print("Top features in D2\n", top_feats_D2, '\n')
top_feats_D3 = top_tfidf_feats(Xtr, features, 2)
print("Top features in D3\n", top_feats_D3, '\n')
A comparison of the results derived from the above is performed against the ones derived from using the standard tf-idf
formula, using three different train set (corpus) sizes (i.e., n=7
, n=100
and n=1000
). Below is the code for computing the tf-idf
scores using the standard formula:
Using the standard tf-idf
formula
import math
from nltk.tokenize import word_tokenize
def tf(term, doc):
terms = [term.lower() for term in word_tokenize(doc)]
return terms.count(term) / len(terms)
def dft(term, corpus):
return sum(1 for doc in corpus if term in [term.lower() for term in word_tokenize(doc)])
def idf(term, corpus):
return math.log(len(corpus) / dft(term, corpus))
def tfidf(term, doc, corpus):
return tf(term, doc) * idf(term, corpus)
for i, doc in enumerate(train):
if i==3: # print results for the first 3 doccuments only
break
print("Top features in D{}".format(i + 1))
scores = {term.lower(): tfidf(term.lower(), doc, train) for term in word_tokenize(doc) if term.isalpha()}
sorted_terms = sorted(scores.items(), key=lambda x: x[1], reverse=True)
df_top_feats = pd.DataFrame()
idx = 0
for term, score in sorted_terms[:10]:
df_top_feats.loc[idx, 'feature'] = term
df_top_feats.loc[idx, 'tfidf'] = round(score, 5)
idx+=1
print(df_top_feats, '\n')
The results below speak for themselves. When only seven documents were used, it can be seen that among the highest scoring words (only the first three documents are shown below) are several stop words. As the number of documents increases, one can see overly common words (across documents) to be eliminated, and others taking their place. Interestingly, as can be seen below, the standard tf-idf
formula does a better job in eliminating frequently occuring terms, even when the corpus's size is relatively small (i.e., n=7
).
Therefore, you can solve the problem by either implementing your own function (as demonstrated above) for calculating the tf-idf
score using the standard formula (and see how that works for you), and/or increasing the size of your corpus (in terms of documents). You could also try disabling smoothing
and/or normalisation
in TfidfVectorizer(smooth_idf=False, norm=None)
; however, the results might not be that different from the ones you currently obtained.
Results:
train = df.Text[:7] train = df.Text[:100] train = df.Text[:1000]
Sklearn Tf-Idf Standard Tf-Idf Sklearn Tf-Idf Standard Tf-Idf Sklearn Tf-Idf Standard Tf-Idf
Top features in D1 Top features in D1 Top features in D1 Top features in D1 Top features in D1 Top features in D1
feature tfidf feature tfidf feature tfidf feature tfidf feature tfidf feature tfidf
0 than 0.301190 0 than 0.07631 0 better 0.275877 0 vitality 0.0903 0 vitality 0.263274 0 vitality 0.13545
1 better 0.301190 1 better 0.07631 1 than 0.243747 1 canned 0.0903 1 appreciates 0.263274 1 labrador 0.13545
2 product 0.250014 2 have 0.04913 2 product 0.229011 2 looks 0.0903 2 labrador 0.263274 2 appreciates 0.13545
3 have 0.250014 3 product 0.04913 3 vitality 0.211030 3 stew 0.0903 3 stew 0.248480 3 stew 0.12186
4 and 0.243790 4 bought 0.03816 4 labrador 0.211030 4 processed 0.0903 4 finicky 0.248480 4 finicky 0.12186
5 of 0.162527 5 several 0.03816 5 stew 0.211030 5 meat 0.0903 5 better 0.238212 5 processed 0.10826
6 quality 0.150595 6 vitality 0.03816 6 looks 0.211030 6 better 0.0903 6 processed 0.229842 6 canned 0.10031
7 meat 0.150595 7 canned 0.03816 7 meat 0.211030 7 labrador 0.0903 7 canned 0.217565 7 smells 0.10031
8 products 0.150595 8 dog 0.03816 8 processed 0.211030 8 finicky 0.0903 8 smells 0.217565 8 meat 0.09030
9 bought 0.150595 9 food 0.03816 9 finicky 0.211030 9 appreciates 0.0903 9 than 0.201924 9 better 0.08952
Top features in D2 Top features in D2 Top features in D2 Top features in D2 Top features in D2 Top features in D2
feature tfidf feature tfidf feature tfidf feature tfidf feature tfidf feature tfidf
0 jumbo 0.341277 0 as 0.10518 0 jumbo 0.411192 0 jumbo 0.24893 0 jumbo 0.491636 0 jumbo 0.37339
1 peanuts 0.341277 1 jumbo 0.10518 1 peanuts 0.377318 1 peanuts 0.21146 1 peanuts 0.389155 1 peanuts 0.26099
2 as 0.341277 2 peanuts 0.10518 2 if 0.232406 2 labeled 0.12446 2 represent 0.245818 2 intended 0.18670
3 product 0.283289 3 product 0.06772 3 product 0.223114 3 salted 0.12446 3 intended 0.245818 3 represent 0.18670
4 the 0.243169 4 arrived 0.05259 4 as 0.214753 4 unsalted 0.12446 4 error 0.232005 4 labeled 0.16796
5 if 0.210233 5 labeled 0.05259 5 salted 0.205596 5 error 0.12446 5 labeled 0.232005 5 error 0.16796
6 actually 0.170638 6 salted 0.05259 6 intended 0.205596 6 vendor 0.12446 6 vendor 0.208391 6 vendor 0.14320
7 sure 0.170638 7 actually 0.05259 7 vendor 0.205596 7 intended 0.12446 7 unsalted 0.198590 7 unsalted 0.13410
8 small 0.170638 8 small 0.05259 8 labeled 0.205596 8 represent 0.12446 8 product 0.186960 8 salted 0.12446
9 sized 0.170638 9 sized 0.05259 9 unsalted 0.205596 9 product 0.10628 9 salted 0.184777 9 sized 0.11954
Top features in D3 Top features in D3 Top features in D3 Top features in D3 Top features in D3 Top features in D3
feature tfidf sfeature tfidf feature tfidf feature tfidf feature tfidf feature tfidf
0 and 0.325182 0 that 0.03570 0 witch 0.261635 0 witch 0.08450 0 witch 0.311210 0 witch 0.12675
1 the 0.286254 1 into 0.03570 1 tiny 0.240082 1 tiny 0.07178 1 tiny 0.224307 1 tiny 0.07832
2 is 0.270985 2 tiny 0.03570 2 treat 0.224790 2 treat 0.06434 2 treat 0.205872 2 treat 0.07089
3 with 0.250113 3 witch 0.03570 3 into 0.203237 3 into 0.05497 3 into 0.192997 3 into 0.06434
4 that 0.200873 4 with 0.03448 4 the 0.200679 4 confection 0.04225 4 is 0.165928 4 confection 0.06337
5 into 0.200873 5 treat 0.02299 5 is 0.195614 5 centuries 0.04225 5 and 0.156625 5 centuries 0.06337
6 witch 0.200873 6 and 0.01852 6 and 0.183265 6 light 0.04225 6 lion 0.155605 6 pillowy 0.06337
7 tiny 0.200873 7 confection 0.01785 7 with 0.161989 7 pillowy 0.04225 7 edmund 0.155605 7 gelatin 0.06337
8 this 0.168355 8 has 0.01785 8 this 0.154817 8 citrus 0.04225 8 seduces 0.155605 8 filberts 0.06337
9 treat 0.166742 9 been 0.01785 9 pillowy 0.130818 9 gelatin 0.04225 9 filberts 0.155605 9 liberally 0.06337