Pandas DataFrame give frequency of Column occurrence whilst maintaining the Doc ID

Question

I have a dataframe and im trying to create a vocabulary of terms from it (I have already tokenized and preprocessed to just a list of all words and the Doc ID attached to it), for example I have

    Word     Doc ID
0   Big         XX      
1   Big         XZ    
2   Small       XD     
3   Big         XC  
4   Little      XY

And I want to group all of the same terms, add a frequency column and keep the Doc ID column with all references for Docs that word appears in, as so:

    Word         Doc ID          Freq
0   Big         XX, XY, XC         3
1   Small       XD                 1 
2   Little      XY                 1

I have tried grouping by the word and using the .count function to return counts but this removes the Doc ID, I also cant concat the freq data as the resultant df wouldnt line up with the values of the first df.

Any help on this please!

score 0 · Answer 1 · answered Jan 23 '22 at 13:35

There is an easier way to do this using groupby and agg.

df.groupby("Word") \
  .agg({"DocID": ", ".join, "Word": pd.Series.value_counts}) \
  .rename(columns={"Word": "Freq"}) \
  .reset_index()

     Word       DocID  Freq
0     Big  XX, XZ, XC     3
1  Little          XY     1
2   Small          XD     1

score -1 · Answer 2 · edited Jan 24 '22 at 04:34

-1

Solved, just as I lost hope:

`df.groupby('Words').agg(lambda x: x.tolist())

I added tolist tocall groups all Doc ID's into a list

edited Jan 24 '22 at 04:34

123 456 789 0

10,565
4
43
72

answered Jan 23 '22 at 13:33

Luke Delves

63
6

this doesn't provide the desired output that you asked for – gold_cy Jan 23 '22 at 13:37
df['Frequency'] = df['Doc ID'].str.len() <- adding that does – Luke Delves Jan 24 '22 at 14:03

Pandas DataFrame give frequency of Column occurrence whilst maintaining the Doc ID

2 Answers2