3

I have a df,

Name      Description
Ram Ram   is one of the good cricketer
Sri Sri   is one of the member
Kumar     Kumar is a keeper

and a list, my_list=["one","good","ravi","ball"]

I am trying to get the rows which are having atleast one keyword from my_list.

I tried,

  mask=df["Description"].str.contains("|".join(my_list),na=False)

I am getting the output_df,

Name    Description
Ram     Ram is one of ONe crickete
Sri     Sri is one of the member
Ravi    Ravi is a player, ravi is playing
Kumar   there is a BALL

I also want to add the keywords present in the "Description" and its counts in a separate columns,

My desired output is,

Name    Description                      pre-keys          keys     count
Ram     Ram is one of ONe crickete         one,good,ONe   one,good    2
Sri     Sri is one of the member           one            one         1
Ravi    Ravi is a player, ravi is playing  Ravi,ravi      ravi        1
Kumar   there is a BALL                    ball           ball        1
Pyd
  • 6,017
  • 18
  • 52
  • 109
  • I've rolled back your edit for now. Go into your edit history, copy your old edit, and paste inside a new question. Thank you! :-) – cs95 Oct 25 '17 at 10:40
  • where to get my edit history ? – Pyd Oct 25 '17 at 10:47
  • https://stackoverflow.com/posts/46926464/revisions – cs95 Oct 25 '17 at 10:47
  • As suggested, here is the new question, please check https://stackoverflow.com/questions/46930681/mapping-matching-word-count-on-a-column-using-pandas-in-python – Pyd Oct 25 '17 at 11:01

2 Answers2

5

Use str.findall + str.join + str.len:

extracted = df['Description'].str.findall('(' + '|'.join(my_list) + ')') 
df['keys'] = extracted.str.join(',')
df['count'] = extracted.str.len()
print (df)
  Name                       Description      keys  count
0  Ram  Ram is one of the good cricketer  one,good      2
1  Sri          Sri is one of the member       one      1

EDIT:

import re
my_list=["ONE","good"]

extracted = df['Description'].str.findall('(' + '|'.join(my_list) + ')', flags=re.IGNORECASE)
df['keys'] = extracted.str.join(',')
df['count'] = extracted.str.len()
print (df)
  Name                       Description      keys  count
0  Ram  Ram is one of the good cricketer  one,good      2
1  Sri          Sri is one of the member       one      1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    I used `extractall`. I like `findall` much better. – piRSquared Oct 25 '17 at 07:35
  • 1
    Thanks for the solution, @Jezrael – Pyd Oct 25 '17 at 07:38
  • @jezrael, how can I apply ignore case, as if my_list contains ["ONE"] it is not matching – Pyd Oct 25 '17 at 10:10
  • I applied ignore case, now I edited the question, added one more column – Pyd Oct 25 '17 at 10:37
  • give me some time, boss need something:( – jezrael Oct 25 '17 at 10:45
  • Ok, so need it or not? I see your comemnts under question. – jezrael Oct 25 '17 at 10:53
  • I explained it clearly in my new question as coldspeed suggest, please check https://stackoverflow.com/questions/46930681/mapping-matching-word-count-on-a-column-using-pandas-in-python – Pyd Oct 25 '17 at 11:02
  • 1
    @pyd - I see your new [question](https://stackoverflow.com/q/47096797/2901002). Can you explain why my edited answer does not work? If want exact match, use first solution, if need ignore coase, use edited answer. Can you explain more? Or there are some data where my solution failed? If yes, can you add this problematic data? thanks. – jezrael Nov 04 '17 at 10:23
  • Hi @Jezrael, currently I'm using your edited solution but in this, `if df['Description'] has 'ravi is good and Ravi is playing ' and my_list =['ravi'] then it adds ['ravi','Ravi'] I don't want this . It should be df['keys'] ='ravi'(we need to usewhat case presented in my_list) and df['count'] =1` please let me know if you need more explanation – Pyd Nov 04 '17 at 11:11
  • Can you edit question with data sample where it is problematic? 3-4 rows are super ;) thanks. – jezrael Nov 04 '17 at 11:12
  • Hmmm, but there first solution in this answer works very nice. – jezrael Nov 04 '17 at 11:33
  • I updated with two keys column first pre-keys should be calculated, keys column should be calculated from pre-keys and not directly from description – Pyd Nov 04 '17 at 11:38
  • I understand, now, give me some time. – jezrael Nov 04 '17 at 11:40
  • I added one more row, first solution is not applying ignore case so It will not map the last row, please check – Pyd Nov 04 '17 at 11:42
  • did you try, @jezrael – Pyd Nov 04 '17 at 12:16
  • Yes, but I get `BALL` in last row as output :( So I am thinking what do for output what you need... – jezrael Nov 04 '17 at 12:19
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/158214/discussion-between-pyd-and-jezrael). – Pyd Nov 04 '17 at 12:22
1

Took a shot at this with str.findall.

c = df.Description.str.findall('({})'.format('|'.join(my_list)))
df['keys'] = c.apply(','.join) # or c.str.join(',')
df['count'] = c.str.len()

df[df['count'] > 0]

  Name                       Description      keys  count
0  Ram  Ram is one of the good cricketer  one,good      2
1  Sri          Sri is one of the member       one      1
cs95
  • 379,657
  • 97
  • 704
  • 746