3

I know it's quite straightforward to use df.str.contains() to check if the column contains a certain substring.

What if I want to do the other way around: check if the column's value is contained by a longer string? I did a search but couldn't find an answer. I thought this should be easy, like in pure python we could simply 'a' in 'abc'

I tried to use df.isin but seems it's not designed for this purpose.

Say I have a df looks like this:

       col1      col2
0     'apple'    'one'
1     'orange'   'two'
2     'banana'   'three'

I want to query this df on col1 if is contained by a string appleorangefruits, it should return me the first two rows.

Ev3rlasting
  • 2,145
  • 4
  • 18
  • 31

5 Answers5

4

You can call an apply on the column, i.e.:

df['your col'].apply(lambda a: a in 'longer string')
Yifei H
  • 76
  • 2
4

As apply is notoriously slow I thought I'd have a play with some other ideas.

If your "long_string" is relatively short and your DataFrame is massive, you could do something weird like this.

from itertools import combinations
from random import choice

# Create a large DataFrame
df = pd.DataFrame(
    data={'test' : [choice('abcdef') for i in range(10_000_000)]}
)

long_string = 'abcdnmlopqrtuvqwertyuiop'

def get_all_substrings(input_string):
    length = len(input_string)
    return [input_string[i:j + 1] for i in range(length) for j in range(i,length)]

sub_strings = get_all_substrings(long_string)

df.test.isin(sub_strings)

This ran in about 300ms vs 2.89s for the above apply(lambda a: a in 'longer string') answers. This is ten times quicker!

Note: I used the get_all_substrings functions from How To Get All The Contiguous Substrings Of A String In Python?

Little Bobby Tables
  • 4,466
  • 4
  • 29
  • 46
3

You need:

longstring = 'appleorangefruits'
df.loc[df['col1'].apply(lambda x: x in longstring)]

Output:

    col1    col2
0   apple   one
1   orange  two
harvpan
  • 8,571
  • 2
  • 18
  • 36
2

If the string you are checking against is a constant, I believe you can achieve it by using DataFrame.apply:

df.apply(lambda row: row['mycol'] in 'mystring', axis=1)

IWHKYB
  • 481
  • 3
  • 11
1

try..

>>> df[df.col1.apply(lambda x: x in 'appleorangefruits')]
     col1 col2
0   apple  one
1  orange  two
Karn Kumar
  • 8,518
  • 3
  • 27
  • 53