6

My question is kind of an extension of the question answered quite well in this link:

I've posted the answer here below where the strings are filtered out when they contain the word "ball":

In [3]: df[df['ids'].str.contains("ball")]
Out[3]:
     ids     vals
0  aball     1
1  bball     2
3  fball     4

Now my question is: what if I have long sentences in my data, and I want to identify strings with the words "ball" AND "field"? So that it throws away data that contains the word "ball" or "field" when only one of them occur, but keeps the ones where the string has both words in it.

ekhumoro
  • 115,249
  • 20
  • 229
  • 336
Mars
  • 341
  • 1
  • 3
  • 12
  • 2
    BTW, if searching for fixed strings (i.e. not regex), you can often use `df['ids'].str.contains("ball", regex=False)` for a bit of a speed boost. – Alex Riley Nov 05 '17 at 18:58

4 Answers4

5
df[df['ids'].str.contains("ball")]

Would become:

df[df['ids'].str.contains("ball") & df['ids'].str.contains("field")]

If you are into neater code:

contains_balls = df['ids'].str.contains("ball")
contains_fields = df['ids'].str.contains("field")

filtered_df = df[contains_balls & contains_fields]
foxyblue
  • 2,859
  • 2
  • 21
  • 29
2

If you have more than 2 , You can using this ..(Notice the speed is not as good as foxyblue's method )

l = ['ball', 'field']
df.ids.apply(lambda x: all(y in x for y in l))
foxyblue
  • 2,859
  • 2
  • 21
  • 29
BENY
  • 317,841
  • 20
  • 164
  • 234
0

You could use np.logical_and.reduce and str.contains takes care of multiple words.

df[np.logical_and.reduce([df['ids'].str.contains(w) for w in ['ball', 'field']])]

In [96]: df
Out[96]:
             ids
0  ball is field
1     ball is wa
2  doll is field

In [97]: df[np.logical_and.reduce([df['ids'].str.contains(w) for w in ['ball', 'field']])]
Out[97]:
             ids
0  ball is field
Zero
  • 74,117
  • 18
  • 147
  • 154
0

Yet another RegEx approach:

In [409]: df
Out[409]:
               ids
0   ball and field
1  ball, just ball
2      field alone
3  field and ball

In [410]: pat = r'(?:ball.*field|field.*ball)'

In [411]: df[df['ids'].str.contains(pat)]
Out[411]:
               ids
0   ball and field
3  field and ball
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419