I want to calculate the TF-IDF of keywords for a given genre. These keywords were never part of a text, they were already separated but in a different format. I extracted them from that format and put them into lists. The same with genres
I had a df in this format:
```keywords,genres
['k1','k2','k3'],['g1','g2']
['k2','k5','k7'],['g1','g3']
['k1','k2','k9'],['g4']
['k6','k7','k8'],['g3','g5]
...```
I used explode on the genres col and got:
```['k1','k2','k3'],g1
['k1','k2','k3'],g2
['k2','k5','k7'],g1
['k2','k5','k7'],g3
['k1','k2','k9'],g4
['k6','k7','k8'],g3
['k6','k7','k8'],g5
...```
then I 'grouped by' genre to have this df_agg:
```genres,keywords
g1,['k1','k2','k3','k2','k5','k7']
g2,['k1','k2','k3']
g3,['k2','k5','k7','k6','k7','k8']
g4,['k1','k2','k9']
g5,['k6','k7','k8']
...```
So I made these changes to calculate the Tf-IDF for the keywords per genre but I'm not sure whether this is the correct format as df_agg['keywords'] is a list but all examples I see online use a text and get the tokens off the text. Doesn't my df_agg structure suggest that genres are documents and the keywords are the tokens ready?
Should I do something different?