-1

I want to calculate the TF-IDF of keywords for a given genre. These keywords were never part of a text, they were already separated but in a different format. I extracted them from that format and put them into lists. The same with genres

I had a df in this format:

```keywords,genres
['k1','k2','k3'],['g1','g2']
['k2','k5','k7'],['g1','g3']
['k1','k2','k9'],['g4']
['k6','k7','k8'],['g3','g5]
...```

I used explode on the genres col and got:

```['k1','k2','k3'],g1
['k1','k2','k3'],g2
['k2','k5','k7'],g1
['k2','k5','k7'],g3
['k1','k2','k9'],g4
['k6','k7','k8'],g3
['k6','k7','k8'],g5
...```

then I 'grouped by' genre to have this df_agg:

```genres,keywords
g1,['k1','k2','k3','k2','k5','k7']
g2,['k1','k2','k3']
g3,['k2','k5','k7','k6','k7','k8']
g4,['k1','k2','k9']
g5,['k6','k7','k8']
...```

So I made these changes to calculate the Tf-IDF for the keywords per genre but I'm not sure whether this is the correct format as df_agg['keywords'] is a list but all examples I see online use a text and get the tokens off the text. Doesn't my df_agg structure suggest that genres are documents and the keywords are the tokens ready?

Should I do something different?

Sergey Bushmanov
  • 23,310
  • 7
  • 53
  • 72

1 Answers1

0

What you're doing is a bit unconventional, but if you wish to do so you can proceed as follows: do one step back and compose a string of your tokens:

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(df["keywords"].apply(lambda x: " ".join(x))).toarray()

which you can put into a df, if you wish:

df_tfidf = pd.DataFrame(tfidf_matrix, columns=tfidf.vocabulary_)
print(df_tfidf)
         k1        k2        k3        k5        k7        k6        k8  \
0  0.359600  0.605014  0.433206  0.433206  0.000000  0.359600  0.000000   
1  0.562638  0.473309  0.677803  0.000000  0.000000  0.000000  0.000000   
2  0.000000  0.279457  0.000000  0.400198  0.400198  0.664401  0.400198   
3  0.503968  0.423954  0.000000  0.000000  0.000000  0.000000  0.000000   
4  0.000000  0.000000  0.000000  0.000000  0.609818  0.506204  0.609818   

         k9  
0  0.000000  
1  0.000000  
2  0.000000  
3  0.752515  
4  0.000000
Sergey Bushmanov
  • 23,310
  • 7
  • 53
  • 72
  • hey Sergey, thanks for the input. I'm following an online solution, but I figured out that I need to change the lists of keywords to a string separated by " ". I think that the step to make them into lists serve more for pre-processing. I believe that is why you called it unconventional, however what would the 'conventional' way be? Actually I just re-read your reply and I'm a bit confused about why I should use the exploded df? Isn't one entry per genre with all the keywords the preferable structure as I want the tf-idf per genre? – idontknowmuch Oct 26 '20 at 23:52
  • @idontknowmuch You need your data in an array (n_samples,n_features) if you want to move further with any ML algo. Of course you do not need it if you want lists of features only for presentational purposes. – Sergey Bushmanov Oct 27 '20 at 19:27
  • i wanted to create a matrix off the last df where there will be as many rows as unique genres and then columns would be each unique keyword and values of the cells the number of times each keyword is present in each genre. Is that what you mean? – idontknowmuch Oct 28 '20 at 15:27
  • @idontknowmuch Yes this is what I provided you with – Sergey Bushmanov Oct 28 '20 at 15:57
  • @idontknowmuch Does this answer your question? If so you may think about accepting it. – Sergey Bushmanov Nov 14 '20 at 20:53