Retrieving matching word count on a datacolumn using pandas in python

Question

I have a df,

Name      Description
Ram Ram   is one of the good cricketer
Sri Sri   is one of the member
Kumar     Kumar is a keeper

and a list, my_list=["one","good","ravi","ball"]

I am trying to get the rows which are having atleast one keyword from my_list.

I tried,

  mask=df["Description"].str.contains("|".join(my_list),na=False)

I am getting the output_df,

Name    Description
Ram     Ram is one of ONe crickete
Sri     Sri is one of the member
Ravi    Ravi is a player, ravi is playing
Kumar   there is a BALL

I also want to add the keywords present in the "Description" and its counts in a separate columns,

My desired output is,

Name    Description                      pre-keys          keys     count
Ram     Ram is one of ONe crickete         one,good,ONe   one,good    2
Sri     Sri is one of the member           one            one         1
Ravi    Ravi is a player, ravi is playing  Ravi,ravi      ravi        1
Kumar   there is a BALL                    ball           ball        1

I've rolled back your edit for now. Go into your edit history, copy your old edit, and paste inside a new question. Thank you! :-) — cs95, Oct 25 '17 at 10:40
As suggested, here is the new question, please check https://stackoverflow.com/questions/46930681/mapping-matching-word-count-on-a-column-using-pandas-in-python — Pyd, Oct 25 '17 at 11:01

jezrael · Accepted Answer · 2017-10-25T10:29:29.870

5

Use str.findall + str.join + str.len:

extracted = df['Description'].str.findall('(' + '|'.join(my_list) + ')') 
df['keys'] = extracted.str.join(',')
df['count'] = extracted.str.len()
print (df)
  Name                       Description      keys  count
0  Ram  Ram is one of the good cricketer  one,good      2
1  Sri          Sri is one of the member       one      1

EDIT:

import re
my_list=["ONE","good"]

extracted = df['Description'].str.findall('(' + '|'.join(my_list) + ')', flags=re.IGNORECASE)
df['keys'] = extracted.str.join(',')
df['count'] = extracted.str.len()
print (df)
  Name                       Description      keys  count
0  Ram  Ram is one of the good cricketer  one,good      2
1  Sri          Sri is one of the member       one      1

edited Oct 25 '17 at 10:29

answered Oct 25 '17 at 07:34

jezrael

822,522
95
1,334
1,252

1

I used `extractall`. I like `findall` much better. – piRSquared Oct 25 '17 at 07:35
1

Thanks for the solution, @Jezrael – Pyd Oct 25 '17 at 07:38
@jezrael, how can I apply ignore case, as if my_list contains ["ONE"] it is not matching – Pyd Oct 25 '17 at 10:10
I applied ignore case, now I edited the question, added one more column – Pyd Oct 25 '17 at 10:37
give me some time, boss need something:( – jezrael Oct 25 '17 at 10:45
Ok, so need it or not? I see your comemnts under question. – jezrael Oct 25 '17 at 10:53
I explained it clearly in my new question as coldspeed suggest, please check https://stackoverflow.com/questions/46930681/mapping-matching-word-count-on-a-column-using-pandas-in-python – Pyd Oct 25 '17 at 11:02
1

@pyd - I see your new [question](https://stackoverflow.com/q/47096797/2901002). Can you explain why my edited answer does not work? If want exact match, use first solution, if need ignore coase, use edited answer. Can you explain more? Or there are some data where my solution failed? If yes, can you add this problematic data? thanks. – jezrael Nov 04 '17 at 10:23
Hi @Jezrael, currently I'm using your edited solution but in this, `if df['Description'] has 'ravi is good and Ravi is playing ' and my_list =['ravi'] then it adds ['ravi','Ravi'] I don't want this . It should be df['keys'] ='ravi'(we need to usewhat case presented in my_list) and df['count'] =1` please let me know if you need more explanation – Pyd Nov 04 '17 at 11:11
Can you edit question with data sample where it is problematic? 3-4 rows are super ;) thanks. – jezrael Nov 04 '17 at 11:12
Hmmm, but there first solution in this answer works very nice. – jezrael Nov 04 '17 at 11:33
I updated with two keys column first pre-keys should be calculated, keys column should be calculated from pre-keys and not directly from description – Pyd Nov 04 '17 at 11:38
I understand, now, give me some time. – jezrael Nov 04 '17 at 11:40
I added one more row, first solution is not applying ignore case so It will not map the last row, please check – Pyd Nov 04 '17 at 11:42
did you try, @jezrael – Pyd Nov 04 '17 at 12:16
Yes, but I get `BALL` in last row as output :( So I am thinking what do for output what you need... – jezrael Nov 04 '17 at 12:19
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/158214/discussion-between-pyd-and-jezrael). – Pyd Nov 04 '17 at 12:22

cs95 · Answer 2 · 2017-10-25T07:46:57.303

1

Took a shot at this with str.findall.

c = df.Description.str.findall('({})'.format('|'.join(my_list)))
df['keys'] = c.apply(','.join) # or c.str.join(',')
df['count'] = c.str.len()

df[df['count'] > 0]

  Name                       Description      keys  count
0  Ram  Ram is one of the good cricketer  one,good      2
1  Sri          Sri is one of the member       one      1

edited Oct 25 '17 at 07:46

answered Oct 25 '17 at 07:36

cs95

379,657
97
704
746

Thanks for the solution, @COLDSPEED – Pyd Oct 25 '17 at 07:39
@pyd Thanks for asking B) – cs95 Oct 25 '17 at 07:39
@COLDSPEED, how can I apply ignore case, as if my_list contains ["ONE"] it is not matching – Pyd Oct 25 '17 at 10:12
@pyd You accepted jezrael's solution though? If you're using his code, you should ask him. – cs95 Oct 25 '17 at 10:14
I found the solution, I have one more doubt as I need that to be handle in my handle, can you help? – Pyd Oct 25 '17 at 10:31
@pyd If it is related to this question, then alright. – cs95 Oct 25 '17 at 10:33
@pyd Honestly, that question requires an involved answer. It isn't as simple as an ignore case flag... please open a new question. Thanks a bunch. – cs95 Oct 25 '17 at 10:39

Retrieving matching word count on a datacolumn using pandas in python

2 Answers2

Linked