2
  • I have a list of length 7 (7 subjectes)
  • Each element in the list contains a long string of words.
  • Each element of the list can be viewed as a topic with a long sentence that sets it apart
  • I want to check which words make each topic unique (each element in the list)

Heres my code:

from sklearn.feature_extraction.text import TfidfVectorizer
train = read_train_file() # A list with huge sentences that I can't paste here

tfidfvectorizer = TfidfVectorizer(analyzer= 'word', stop_words= 'english')
tfidf_wm        = tfidfvectorizer.fit_transform(train)
tfidf_tokens    = tfidfvectorizer.get_feature_names()

df_tfidfvect = pd.DataFrame(data = tfidf_wm.toarray(), index=train_df.discourse_type.unique(), columns = tfidf_tokens)


for col in df_tfidfvect.T.columns:    
    print(f"\nsubjetct: {col}")
    print(df_tfidfvect.T[col].nlargest(2))

Part of train data:

for i, v in enumerate(train):
    print(f"subject: {i}: {train[i][:50]}")

output:

subjetct: Position
people    0.316126
school    0.211516
Name: Position, dtype: float64

subjetct: Claim
people    0.354722
school    0.296632
Name: Claim, dtype: float64

subjetct: Evidence
people    0.366234
school    0.282213
Name: Evidence, dtype: float64

subjetct: Concluding Statement
people    0.385200
help      0.267567
Name: Concluding Statement, dtype: float64

subjetct: Lead
people    0.399011
school    0.336605
Name: Lead, dtype: float64

subjetct: Counterclaim
people       0.361070
electoral    0.321909
Name: Counterclaim, dtype: float64

subjetct: Rebuttal
people    0.31029
school    0.26789
Name: Rebuttal, dtype: float64

As you can see, "people" and "school" have high tf-idf values.

Maybe I'm wrong, but I was expecting words that specialize in a topic, won't be the same words in all topics (according to TF-IDF formula ).

Part of train data:

for i, v in enumerate(train):
    print(f"subject: {i}: {train[i][:50]}")

subject: 0: like policy people average cant play sports b poin
subject: 1: also stupid idea sports suppose fun privilege play
subject: 2: failing fail class see act higher c person could g
subject: 3: unfair rule thought think new thing shaped land fo
subject: 4: land form found human thought many either fight de
subject: 5: want say know trying keep class also quite expensi
subject: 6: even less sense saying first find something really

So what is wrong with TfidfVectorizer ?

Chris
  • 18,724
  • 6
  • 46
  • 80
user3668129
  • 4,318
  • 6
  • 45
  • 87

1 Answers1

1

As per scikit-learn/sklearn's TfidfVectorizer documentation (actually TfidfTransformer, which is internally used to trasnform count matrix to a tf-idf representation), the idf formula:

is computed as idf(t) = log [ n / df(t) ] + 1 (if smooth_idf=False), where n is the total number of documents in the document set and df(t) is the document frequency of t; the document frequency is the number of documents in the document set that contain the term t.

Note that the idf formula above differs from the standard textbook notation that defines the idf as idf(t) = log [ n / (df(t) + 1) ].

If smooth_idf=True (the default), the constant “1” is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions: idf(t) = log [ (1 + n) / (1 + df(t)) ] + 1.

In short, sklearn's TfidfVectorizer uses a different formula from the standard one which is normally either idf(t) = log [ n / df(t) ] or idf(t) = log [ n / (df(t) + 1) ] (denominator is adjusted to prevent zero divisions, if a term is not in the corpus). Additionally:

Tf is "n" (natural) by default

meaning that sklearn uses as tf the number of times a term 't' appears in a document, not the relative frequency i.e., (number of times term 't' occurs in a document) / (number of terms in a document). Further, sklearn uses cosine similarity normalisation:

Normalization is “c” (cosine) when norm='l2'

For the reasons above, the results might differ from applying the standard tf-idf formula. Additionally, when the corpus size is very small, frequently occuring words across the corpus will be given a high tf-idf score. Whereas, words that are frequent in a document and rare in every other document should be given higher tf-idf scores. It is very likely that if you remove the stopwords filter from TfidfVectorizer(stop_words= 'english'), you will even see stop words being among the top-scoring words; whereas, tf-idf is known for being used to remove stop words as well, as stop words are very common terms across a corpus, and thus, are usually given very low scores (on a side note, stop words might be considered noise for a particular dataset/domain, but could also be highly informative features for another dataset/domain. Thus, removing them or not should be based on experimentation and results analysis. On the other hand, if one considers generating word bigrams and/or trigrams besides single terms, stop words elimination would allow them to better match).

As mentioned above, this occurs when the corpus (documents' collection) size is rather small (seven documents, in your case). In this case, as explained here, it is very likely that several words appearing in every document of the corpus to be penalised the same way, causing frequently occuring words to have a higher tf-idf score due to their higher term frequency. If, for example, the word "customer" occurs just as "people" in your corpus (i.e., both appear in the same number of documents), their idf value will be the same; however, frequently occuring words (such as stop words, if not eliminated, or like "people" in your example), due to their higher term frequency tf, they will be given higher tf-idf scores than words such as "customer", which might appear in every document as well (as an example), but with lower term frequency. To demonstrate this, see the below using sklearn's TfidfVectorizer (stop words filter was opted out on purpose). The sample data used below come from here. The function for returning the highest scoring words is based on this article (which I'd recommend having a look at it).

Using sklearn's TfidfVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np

df = pd.read_csv("Reviews.csv", usecols = ['Text'])
train = df.Text[:7]

#tfidf = TfidfVectorizer(analyzer= 'word', stop_words= 'english')
tfidf = TfidfVectorizer(analyzer= 'word')

Xtr = tfidf.fit_transform(train)
features = tfidf.get_feature_names_out()

 # Get top n tfidf values in row and return them with their corresponding feature names
def top_tfidf_feats(Xtr, features, row_id, top_n=10):
    row = np.squeeze(Xtr[row_id].toarray())  # convert the row into dense format first
    topn_ids = np.argsort(row)[::-1][:top_n] # produce the indices that would order the row by tf-idf value, reverse them (into descending order), and select the top_n
    top_feats = [(features[i], row[i]) for i in topn_ids]
    df = pd.DataFrame(data=top_feats ,columns=['feature', 'tfidf'])
    return df

top_feats_D1 = top_tfidf_feats(Xtr, features, 0)
print("Top features in D1\n", top_feats_D1, '\n')

top_feats_D2 = top_tfidf_feats(Xtr, features, 1)
print("Top features in D2\n", top_feats_D2, '\n')

top_feats_D3 = top_tfidf_feats(Xtr, features, 2)
print("Top features in D3\n", top_feats_D3, '\n')

A comparison of the results derived from the above is performed against the ones derived from using the standard tf-idf formula, using three different train set (corpus) sizes (i.e., n=7, n=100 and n=1000). Below is the code for computing the tf-idf scores using the standard formula:

Using the standard tf-idf formula

import math
from nltk.tokenize import word_tokenize

def tf(term, doc):
    terms = [term.lower() for term in word_tokenize(doc)]
    return terms.count(term) / len(terms)

def dft(term, corpus):
    return sum(1 for doc in corpus if term in [term.lower() for term in word_tokenize(doc)])

def idf(term, corpus):
    return math.log(len(corpus) /  dft(term, corpus))

def tfidf(term, doc, corpus):
    return tf(term, doc) * idf(term, corpus)

for i, doc in enumerate(train):
    if i==3: # print results for the first 3 doccuments only
        break
    print("Top features in D{}".format(i + 1))
    scores = {term.lower(): tfidf(term.lower(), doc, train) for term in word_tokenize(doc) if term.isalpha()} 
    sorted_terms = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    df_top_feats = pd.DataFrame()
    idx = 0
    for term, score in sorted_terms[:10]:
        df_top_feats.loc[idx, 'feature'] = term
        df_top_feats.loc[idx, 'tfidf'] = round(score, 5)
        idx+=1
    print(df_top_feats, '\n')

The results below speak for themselves. When only seven documents were used, it can be seen that among the highest scoring words (only the first three documents are shown below) are several stop words. As the number of documents increases, one can see overly common words (across documents) to be eliminated, and others taking their place. Interestingly, as can be seen below, the standard tf-idf formula does a better job in eliminating frequently occuring terms, even when the corpus's size is relatively small (i.e., n=7).

Therefore, you can solve the problem by either implementing your own function (as demonstrated above) for calculating the tf-idf score using the standard formula (and see how that works for you), and/or increasing the size of your corpus (in terms of documents). You could also try disabling smoothing and/or normalisation in TfidfVectorizer(smooth_idf=False, norm=None); however, the results might not be that different from the ones you currently obtained.

Results:

            train = df.Text[:7]                                  train = df.Text[:100]                                   train = df.Text[:1000]
     Sklearn Tf-Idf         Standard Tf-Idf             Sklearn Tf-Idf            Standard Tf-Idf                Sklearn Tf-Idf              Standard Tf-Idf

  Top features in D1      Top features in D1          Top features in D1         Top features in D1            Top features in D1           Top features in D1
    feature     tfidf       feature    tfidf             feature     tfidf            feature   tfidf               feature     tfidf            feature    tfidf
0      than  0.301190   0      than  0.07631        0     better  0.275877     0     vitality  0.0903        0     vitality  0.263274     0     vitality  0.13545
1    better  0.301190   1    better  0.07631        1       than  0.243747     1       canned  0.0903        1  appreciates  0.263274     1     labrador  0.13545
2   product  0.250014   2      have  0.04913        2    product  0.229011     2        looks  0.0903        2     labrador  0.263274     2  appreciates  0.13545
3      have  0.250014   3   product  0.04913        3   vitality  0.211030     3         stew  0.0903        3         stew  0.248480     3         stew  0.12186
4       and  0.243790   4    bought  0.03816        4   labrador  0.211030     4    processed  0.0903        4      finicky  0.248480     4      finicky  0.12186
5        of  0.162527   5   several  0.03816        5       stew  0.211030     5         meat  0.0903        5       better  0.238212     5    processed  0.10826
6   quality  0.150595   6  vitality  0.03816        6      looks  0.211030     6       better  0.0903        6    processed  0.229842     6       canned  0.10031
7      meat  0.150595   7    canned  0.03816        7       meat  0.211030     7     labrador  0.0903        7       canned  0.217565     7       smells  0.10031
8  products  0.150595   8       dog  0.03816        8  processed  0.211030     8      finicky  0.0903        8       smells  0.217565     8         meat  0.09030
9    bought  0.150595   9      food  0.03816        9    finicky  0.211030     9  appreciates  0.0903        9         than  0.201924     9       better  0.08952
                                                                                                                                          
  Top features in D2      Top features in D2          Top features in D2         Top features in D2            Top features in D2           Top features in D2
    feature     tfidf       feature    tfidf             feature    tfidf           feature    tfidf              feature     tfidf            feature    tfidf
0     jumbo  0.341277   0        as  0.10518        0     jumbo  0.411192      0      jumbo  0.24893         0      jumbo  0.491636       0      jumbo  0.37339
1   peanuts  0.341277   1     jumbo  0.10518        1   peanuts  0.377318      1    peanuts  0.21146         1    peanuts  0.389155       1    peanuts  0.26099
2        as  0.341277   2   peanuts  0.10518        2        if  0.232406      2    labeled  0.12446         2  represent  0.245818       2   intended  0.18670
3   product  0.283289   3   product  0.06772        3   product  0.223114      3     salted  0.12446         3   intended  0.245818       3  represent  0.18670
4       the  0.243169   4   arrived  0.05259        4        as  0.214753      4   unsalted  0.12446         4      error  0.232005       4    labeled  0.16796
5        if  0.210233   5   labeled  0.05259        5    salted  0.205596      5      error  0.12446         5    labeled  0.232005       5      error  0.16796
6  actually  0.170638   6    salted  0.05259        6  intended  0.205596      6     vendor  0.12446         6     vendor  0.208391       6     vendor  0.14320
7      sure  0.170638   7  actually  0.05259        7    vendor  0.205596      7   intended  0.12446         7   unsalted  0.198590       7   unsalted  0.13410
8     small  0.170638   8     small  0.05259        8   labeled  0.205596      8  represent  0.12446         8    product  0.186960       8     salted  0.12446
9     sized  0.170638   9     sized  0.05259        9  unsalted  0.205596      9    product  0.10628         9     salted  0.184777       9      sized  0.11954 
                                                                                                                                          
  Top features in D3      Top features in D3          Top features in D3         Top features in D3            Top features in D3           Top features in D3
   feature    tfidf           sfeature    tfidf          feature    tfidf             feature    tfidf             feature    tfidf             feature     tfidf
0     and  0.325182     0        that  0.03570      0    witch  0.261635       0       witch  0.08450        0     witch  0.311210        0       witch  0.12675
1     the  0.286254     1        into  0.03570      1     tiny  0.240082       1        tiny  0.07178        1      tiny  0.224307        1        tiny  0.07832
2      is  0.270985     2        tiny  0.03570      2    treat  0.224790       2       treat  0.06434        2     treat  0.205872        2       treat  0.07089
3    with  0.250113     3       witch  0.03570      3     into  0.203237       3        into  0.05497        3      into  0.192997        3        into  0.06434
4    that  0.200873     4        with  0.03448      4      the  0.200679       4  confection  0.04225        4        is  0.165928        4  confection  0.06337
5    into  0.200873     5       treat  0.02299      5       is  0.195614       5   centuries  0.04225        5       and  0.156625        5   centuries  0.06337
6   witch  0.200873     6         and  0.01852      6      and  0.183265       6       light  0.04225        6      lion  0.155605        6     pillowy  0.06337
7    tiny  0.200873     7  confection  0.01785      7     with  0.161989       7     pillowy  0.04225        7    edmund  0.155605        7     gelatin  0.06337
8    this  0.168355     8         has  0.01785      8     this  0.154817       8      citrus  0.04225        8   seduces  0.155605        8    filberts  0.06337
9   treat  0.166742     9        been  0.01785      9  pillowy  0.130818       9     gelatin  0.04225        9  filberts  0.155605        9   liberally  0.06337 
Chris
  • 18,724
  • 6
  • 46
  • 80