0

I have a dataframe and im trying to create a vocabulary of terms from it (I have already tokenized and preprocessed to just a list of all words and the Doc ID attached to it), for example I have

    Word     Doc ID
0   Big         XX      
1   Big         XZ    
2   Small       XD     
3   Big         XC  
4   Little      XY 

And I want to group all of the same terms, add a frequency column and keep the Doc ID column with all references for Docs that word appears in, as so:

    Word         Doc ID          Freq
0   Big         XX, XY, XC         3
1   Small       XD                 1 
2   Little      XY                 1

I have tried grouping by the word and using the .count function to return counts but this removes the Doc ID, I also cant concat the freq data as the resultant df wouldnt line up with the values of the first df.

Any help on this please!

2 Answers2

0

There is an easier way to do this using groupby and agg.

df.groupby("Word") \
  .agg({"DocID": ", ".join, "Word": pd.Series.value_counts}) \
  .rename(columns={"Word": "Freq"}) \
  .reset_index()

     Word       DocID  Freq
0     Big  XX, XZ, XC     3
1  Little          XY     1
2   Small          XD     1
gold_cy
  • 13,648
  • 3
  • 23
  • 45
-1

Solved, just as I lost hope:

`df.groupby('Words').agg(lambda x: x.tolist())

I added tolist tocall groups all Doc ID's into a list

123 456 789 0
  • 10,565
  • 4
  • 43
  • 72