1

Hi I have two DataFrames like below

 DF1

 Alpha   |  Numeric  |  Special

 and     |   1        |  @
 or      |   2       |  #
 lol ok  |   4       |  &






DF2 with single column

Content          

boy or girl  
school @ morn
pyc LoL ok student
Chandra

I want to search if anyone of the column in DF1 has anyone of the keyword in content column of DF2 and the output should be in a new DF

 `df11 = (df1.unstack()
      .reset_index(level=2,drop=True)
      .rename_axis(('col_order','col_name'))
      .dropna()
      .reset_index(name='val_low'))

 df22 = (df2['Content'].str.split(expand=True)
                 .stack()
                 .rename('val')
                 .reset_index(level=1,drop=True)
                 .rename_axis('idx')
                 .reset_index())`

 df22['val_low'] = df22['val'].str.lower()                    

 df = (pd.merge(df22, df11, on='val_low', how='left')
   .dropna(subset=['col_name'])
   .sort_values(['idx','col_order'])
   .drop_duplicates(['idx']))


 df = (pd.concat([df2, df.set_index('idx')], axis=1)
   .fillna({'col_name':'Other'})[['val','col_name','Content']])

but it is not considering the spaces between lol ok

 expected_output_DF

     val      col_name          Content
 0   or       Alpha             boy or girl
 1    @      Special            school @ morn
 2   lol ok  Alpha              pyc LoL ok student
 3  NaN      Other              Chandra

someone help me with this

Pyd
  • 6,017
  • 18
  • 52
  • 109

2 Answers2

2

Melt the dataframe 1 and convert it to dict. Then get the values of dict based on key obtained by pattern match in dataframe 2 i.e

vals = df.melt()
di = dict(zip(vals['value'],vals['variable']))
# {'or': 'Alpha', 1: 'Numeric', 2: 'Numeric', 'and': 'Alpha', 4: 'Numeric', '@': 'Special', '#': 'Special', '&': 'Special', 'Special': 'new', 'Alpha': 'new', 'lol ok': 'Alpha'}

#Create a regex pattern based on dict keys. 
pat = '|'.join(r"\s{}\s".format(x) for x in di.keys())

#Find the words that match the pattern 
df2['val'] = df2['Content'].str.lower().str.findall(pat).apply(lambda x : x[0].strip() if len(x)>=1 else np.nan)

# Map the values with di and fill nan with other. 
df2['new'] = df2['val'].map(di).fillna('other')

Ouptut :


             Content      new     val
0         boy or girl    Alpha      or
1       school @ morn  Special       @
2  pyc LoL ok student    Alpha  lol ok
3             Chandra    other     NaN
Bharath M Shetty
  • 30,075
  • 6
  • 57
  • 108
  • I cannot melt my dataframe, I think the syntax should be `pd.melt(df)`, can you confirm and edit the same in the solution – Pyd Nov 06 '17 at 09:08
  • I am getting this error, `TypeError: object of type 'float' has no len()` on `df2['val'] = df2[content_col].str.lower().str.findall(pat).apply(lambda x : x[0].strip() if len(x)>=1 else np.nan)` – Pyd Nov 06 '17 at 10:45
  • Which pandas version are you using? – Bharath M Shetty Nov 06 '17 at 10:50
  • pandas version is '0.19.2' , I did filledna() but it affects in other ways – Pyd Nov 06 '17 at 10:52
  • Update ur pandas version and try again – Bharath M Shetty Nov 06 '17 at 11:10
  • @Bharath , can you check this one https://stackoverflow.com/questions/47386216/matching-values-between-two-dataframes-with-a-condition-in-pandas – Pyd Nov 20 '17 at 06:45
1

Use str.cat + str.extract. Then, use map for the column names, and pd.concat to join.

i = df.stack().astype(str)
j = i.reset_index(level=0, drop=1)

m = dict(zip(j.values, j.index))
v = i.str.cat(sep='|')  

df2['val'] = df2.Content.str.extract(r'\s(' + v + r')\s', flags=re.I, expand=False)
df2['col_name'] = df2['val'].str.lower().map(m).fillna('Other')

df2

              Content     val col_name
0         boy or girl      or    Alpha
1       school @ morn       @  Special
2  pyc LoL ok student  LoL ok    Alpha
3             Chandra     NaN    Other

Details

  • i and j are setup variable to create the mapping
  • m is the mapping of values to column names
  • v is the regex pattern that is sent to str.extract for keyword extraction. I use re.I to ignore the case
  • I use map + fillna to map extracted values to column names via m
v
'and|1|@|or|2|#|lol ok|4|&'

m
{'#': 'Special',
 '&': 'Special',
 '1': 'Numeric',
 '2': 'Numeric',
 '4': 'Numeric',
 '@': 'Special',
 'and': 'Alpha',
 'lol ok': 'Alpha',
 'or': 'Alpha'}


df['val']

0        or
1         @
2    LoL ok
3       NaN
Name: val, dtype: object

df['col_name']

0      Alpha
1    Special
2      Alpha
3      Other
Name: col_name, dtype: object
cs95
  • 379,657
  • 97
  • 704
  • 746
  • @CHANDRAMURUGAN Check the top of my answer, it's there. – cs95 Nov 06 '17 at 07:22
  • 1
    @cᴏʟᴅsᴘᴇᴇᴅ so many new terms in the code. You need to update it. I think its `df2['val'] =` and also not jus `val.str` maybe `df2['val'l.str`. – Bharath M Shetty Nov 06 '17 at 07:36
  • @Bharath Oh man, there were a ton of those bugs, but I've fixed them. – cs95 Nov 06 '17 at 07:41
  • Cheers, me too :) – cs95 Nov 06 '17 at 07:42
  • @coldspped, tried your solution its is not working when the mapping keyword present in the end like if the keyword in df is "displayed" and in `df2["content"]="Policies not getting renewed displayed"` its going under other category – Pyd Nov 06 '17 at 10:55
  • @CHANDRAMURUGAN Change `\s` to `\b` in the pattern and see if it works. – cs95 Nov 06 '17 at 11:14