0

I have a pandas df with a column in which each cell contains a single line of text from Shakespeare plays (100K rows roughly). I need to find exact terms (like 'Rome') while excluding the pattern when it appears inside another word (so not 'Romeo'). I cannot afford to exclude cases like 'Rome.' or 'Rome?'.

I came close with the line below, defining 'Rome' as the variable so I could replace it easily by other terms, but it still doesn't quite work.

df[(df['COL'].str.contains(" " + VAR + " ")) | (df['COL'].str.contains(VAR + ";"))].nunique() 
d_kennetz
  • 5,219
  • 5
  • 21
  • 44

2 Answers2

1

You need to add regex= False to your contains to get the exact expression you want.

df[df['COL'].str.contains('Rome', regex= False)]
cs95
  • 379,657
  • 97
  • 704
  • 746
D.Sanders
  • 98
  • 6
0

You need to use regex for that:

df = pd.DataFrame({
    'COL': ['aRomeo', 'Rome', 'Rome?', 'Rome.', '!Rome!', 'djkfnjk Rome dsfln']
})
df.loc[df['COL'].str.lower().str.contains(r'\b\W?rome\W?\b')]
cs95
  • 379,657
  • 97
  • 704
  • 746
Ohad Chaet
  • 489
  • 2
  • 12
  • Thank you. I had tried '.match' and 'regex = False' as suggested above but they don't seem to work. Your regex solution works perfectly. I don't know how to make it run with a variable but I lost so many hours on this today that having a solution is more than enough for now. Thank you again. – aodhanlutetiae Feb 27 '19 at 21:53