1

I have a df,

Name    Step     Description
Ram        1     Ram is oNe of the good cricketer
Ram        2     gopal one
Sri        1     Sri is one of the member
Sri        2     ravi good 
Kumar      1     Kumar is a keeper
Madhu      1     good boy
Vignesh    1     oNe little
Pechi      1     one book
mario      1     good randokm
Roger      1     one milita good
bala       1     looks good
raj        1     more one
venk       1     likes good

and a list,

my_list=["one","good"]

I am trying to get the rows which are having atleast one keyword from my_list.

I tried, mask=df["Description"].str.contains("|".join(my_list),na=False) I am getting the output_df,

Name    Description
Ram     Ram is one of the good cricketer
Sri     Sri is one of the member        

I also want to add the keywords present in the "Description" and its counts in a separate columns,

Even the "Description" contains a keyword when the df["Name"] is not a first time occureance it should not copy the keyword in keys column My desired output is,

my_desired output is,

 Name   Step    Description                          keys        count
 Ram     1     Ram is one of the good cricketer      one,good    2
 Ram     2     gopal one
 Sri     1     Sri is one of the member              one         1
 Sri     2     ravi good
 Kumar   1     Kumar is a keeper
 Madhu   1     good boy                              good        1
 Vignesh 1     oNe little                            oNe         1
 Pechi   1     one book                              one         1 
 mario   1     good randokm good                     good        1
 Roger   1     one milita good                       one,good    2
 bala    1     looks good                            good        1
 raj     1     more one                              one         1
 venk    1     likes good                            good        1
Pyd
  • 6,017
  • 18
  • 52
  • 109

1 Answers1

1

Create new mask and apply it:

my_list=["one","good"]

mask=df["Description"].str.contains("|".join(my_list),na=False,flags=re.IGNORECASE ) & \
     (df.groupby('Name').cumcount() == 0)
print (mask)
0      True
1     False
2      True
3     False
4     False
5      True
6      True
7      True
8      True
9      True
10     True
11     True
12     True
dtype: bool

extracted = df['Description'].str.findall('(' + '|'.join(my_list) + ')', flags=re.IGNORECASE)
df.loc[mask, 'keys'] = extracted.str.join(',')
df.loc[mask, 'count'] = extracted.str.len()
print (df)
       Name  Step                       Description      keys  count
0       Ram     1  Ram is oNe of the good cricketer  oNe,good    2.0
1       Ram     2                         gopal one       NaN    NaN
2       Sri     1          Sri is one of the member       one    1.0
3       Sri     2                        ravi good        NaN    NaN
4     Kumar     1                 Kumar is a keeper       NaN    NaN
5     Madhu     1                          good boy      good    1.0
6   Vignesh     1                        oNe little       oNe    1.0
7     Pechi     1                          one book       one    1.0
8     mario     1                      good randokm      good    1.0
9     Roger     1                   one milita good  one,good    2.0
10     bala     1                        looks good      good    1.0
11      raj     1                          more one       one    1.0
12     venk     1                        likes good      good    1.0

EDIT:

#transform all values if need same size of original
s = df.groupby('Name')['Description'].transform(','.join)
print (s)
0     Ram is oNe of the good cricketer,gopal one
1     Ram is oNe of the good cricketer,gopal one
2            Sri is one of the member,ravi good 
3            Sri is one of the member,ravi good 
4                              Kumar is a keeper
5                                       good boy
6                                     oNe little
7                                       one book
8                              good randokm good
9                                one milita good
10                                    looks good
11                                      more one
12                                    likes good
Name: Description, dtype: object

#for mask use new Series s
mask=s.str.contains("|".join(my_list),na=False,flags=re.IGNORECASE ) & \
     (df.groupby('Name').cumcount() == 0)
print (mask)
0      True
1     False
2      True
3     False
4     False
5      True
6      True
7      True
8      True
9      True
10     True
11     True
12     True
dtype: bool

#extract from new Series s
extracted = s.str.findall('(' + '|'.join(my_list) + ')', flags=re.IGNORECASE).apply(set)
df.loc[mask, 'keys'] = extracted.str.join(',')
df.loc[mask, 'count'] = extracted.str.len()
print (df)
       Name  Step                       Description          keys  count
0       Ram     1  Ram is oNe of the good cricketer  good,oNe,one    3.0
1       Ram     2                         gopal one           NaN    NaN
2       Sri     1          Sri is one of the member      good,one    2.0
3       Sri     2                        ravi good            NaN    NaN
4     Kumar     1                 Kumar is a keeper           NaN    NaN
5     Madhu     1                          good boy          good    1.0
6   Vignesh     1                        oNe little           oNe    1.0
7     Pechi     1                          one book           one    1.0
8     mario     1                 good randokm good          good    1.0
9     Roger     1                   one milita good      good,one    2.0
10     bala     1                        looks good          good    1.0
11      raj     1                          more one           one    1.0
12     venk     1                        likes good          good    1.0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • I dont want to consider Step column, I want to apply the logic on "Name" column. When the name value occurence is first time. as you see in index=1, Ram occured 2nd time so we should not consider the keywords on the row with index 1 – Pyd Oct 25 '17 at 11:08
  • Ok, do you think `cumcunt` for count accurences? – jezrael Oct 25 '17 at 11:10
  • And match only first values? `print (df.groupby('Name').cumcount())` Equal to `0` ? – jezrael Oct 25 '17 at 11:11
  • ` 0 0 1 1 2 0 3 1 4 0 ` – Pyd Oct 25 '17 at 11:17
  • and for data? can you explain more? – jezrael Oct 25 '17 at 11:17
  • Ok, I think I need more data. I have idea - can you create 10 values of `Name` column and get another column which values are True ? Thanks. – jezrael Oct 25 '17 at 11:19
  • for index 8 th row, i gave "Description" as good randokm good, so in keys ill get good,good . I want to remove duplicate in this column if a cell in keys contains good,good it should replaced with good – Pyd Oct 25 '17 at 12:27
  • Need `apply(set)` like `extracted = df['Description'].str.findall('(' + '|'.join(my_list) + ')', flags=re.IGNORECASE).apply(set)` – jezrael Oct 25 '17 at 12:32
  • Thank you @jezrael. is it possible first putting the values of df[keys] when df.groupby('Name').cumcount() != 0 , like in my above question . In index 1 row we are not considering the keyword one from gopal one. is it possble to append this "one" to the first occurence of Ram (i.e) in index 0 df[ key] – Pyd Oct 25 '17 at 12:41
  • So first row is not changed (because add another `one` and set remove it) and for `sri 1` get `one, good 2` ? – jezrael Oct 25 '17 at 12:46
  • Not mean to bother,but I want to know how you get the `df` I try `df = pd.read_csv(io.StringIO(t),delimiter=' ') `with no correct output,`t` is the copy of the data from OP @jezrael – ileadall42 Oct 29 '17 at 05:10
  • @Tangfeifan - Try `df = pd.read_csv(io.StringIO(t),sep='\s{2,}', engine='python')` – jezrael Oct 29 '17 at 06:07
  • It works,thanks!@jezrael Actually i have try sep- -,but at that time it didn't work! – ileadall42 Oct 30 '17 at 02:13
  • I am getting `TypeError: sequence item 2: expected str instance, float found`in `s = df.groupby('Name')['Description'].transform(','.join)` – Pyd Oct 30 '17 at 12:51
  • Use `s = df.groupby('Name')['Description'].transform(lambda x: ','.join(x.astype(str)))` – jezrael Oct 30 '17 at 12:55
  • 1
    ok but I did, `fillna("")` then `s= df.groupby('Name')['Description'].transform(','.join)` it, works. Do I need to change it to `s = df.groupby('Name')['Description'].transform(lambda x: ','.join(x.astype(str)))` – Pyd Oct 30 '17 at 13:24
  • Yes, but it can be also error some numeric value, so `astype(str)`. Or if use `s= df.fillna('').groupby('Name')['Description'].transform(','.join)` it also failed? – jezrael Oct 30 '17 at 13:26