Locating exact term in pandas series of strings

Question

I have a pandas df with a column in which each cell contains a single line of text from Shakespeare plays (100K rows roughly). I need to find exact terms (like 'Rome') while excluding the pattern when it appears inside another word (so not 'Romeo'). I cannot afford to exclude cases like 'Rome.' or 'Rome?'.

I came close with the line below, defining 'Rome' as the variable so I could replace it easily by other terms, but it still doesn't quite work.

df[(df['COL'].str.contains(" " + VAR + " ")) | (df['COL'].str.contains(VAR + ";"))].nunique()

score 1 · Answer 1 · edited Feb 27 '19 at 21:30

1

You need to add regex= False to your contains to get the exact expression you want.

df[df['COL'].str.contains('Rome', regex= False)]

edited Feb 27 '19 at 21:30

cs95

379,657
97
704
746

answered Feb 27 '19 at 21:28

D.Sanders

98
6

score 0 · Answer 2 · edited Feb 27 '19 at 21:30

0

You need to use regex for that:

df = pd.DataFrame({
    'COL': ['aRomeo', 'Rome', 'Rome?', 'Rome.', '!Rome!', 'djkfnjk Rome dsfln']
})
df.loc[df['COL'].str.lower().str.contains(r'\b\W?rome\W?\b')]

edited Feb 27 '19 at 21:30

cs95

379,657
97
704
746

answered Feb 27 '19 at 21:26

Ohad Chaet

489
2
12

Thank you. I had tried '.match' and 'regex = False' as suggested above but they don't seem to work. Your regex solution works perfectly. I don't know how to make it run with a variable but I lost so many hours on this today that having a solution is more than enough for now. Thank you again. – aodhanlutetiae Feb 27 '19 at 21:53

Locating exact term in pandas series of strings

2 Answers2