How to check if a string is in a longer string in pandas DataFrame?

Question

I know it's quite straightforward to use df.str.contains() to check if the column contains a certain substring.

What if I want to do the other way around: check if the column's value is contained by a longer string? I did a search but couldn't find an answer. I thought this should be easy, like in pure python we could simply 'a' in 'abc'

I tried to use df.isin but seems it's not designed for this purpose.

Say I have a df looks like this:

       col1      col2
0     'apple'    'one'
1     'orange'   'two'
2     'banana'   'three'

I want to query this df on col1 if is contained by a string appleorangefruits, it should return me the first two rows.

Is the longer string you want to check against a constant, or does it vary from case to case? — Kevin Troy, Aug 15 '19 at 15:48
@KevinTroythanks Kevin. It varies, for example, I have a column called ID in the df. But somehow the user provides me another format of ID which is a bit longer. I want to iterate the ID list to find out those matched rows. — Ev3rlasting, Aug 15 '19 at 16:15

score 4 · Answer 1 · answered Aug 15 '19 at 15:49

4

You can call an apply on the column, i.e.:

df['your col'].apply(lambda a: a in 'longer string')

answered Aug 15 '19 at 15:49

Yifei H

76
2

score 4 · Accepted Answer · answered Aug 15 '19 at 16:32

As apply is notoriously slow I thought I'd have a play with some other ideas.

If your "long_string" is relatively short and your DataFrame is massive, you could do something weird like this.

from itertools import combinations
from random import choice

# Create a large DataFrame
df = pd.DataFrame(
    data={'test' : [choice('abcdef') for i in range(10_000_000)]}
)

long_string = 'abcdnmlopqrtuvqwertyuiop'

def get_all_substrings(input_string):
    length = len(input_string)
    return [input_string[i:j + 1] for i in range(length) for j in range(i,length)]

sub_strings = get_all_substrings(long_string)

df.test.isin(sub_strings)

This ran in about 300ms vs 2.89s for the above apply(lambda a: a in 'longer string') answers. This is ten times quicker!

Note: I used the get_all_substrings functions from How To Get All The Contiguous Substrings Of A String In Python?

score 3 · Answer 3 · answered Aug 15 '19 at 16:41

3

You need:

longstring = 'appleorangefruits'
df.loc[df['col1'].apply(lambda x: x in longstring)]

Output:

    col1    col2
0   apple   one
1   orange  two

answered Aug 15 '19 at 16:41

harvpan

8,571
2
18
36

score 2 · Answer 4 · answered Aug 15 '19 at 15:49

2

If the string you are checking against is a constant, I believe you can achieve it by using DataFrame.apply:

df.apply(lambda row: row['mycol'] in 'mystring', axis=1)

answered Aug 15 '19 at 15:49

IWHKYB

481
3
11

score 1 · Answer 5 · answered Aug 15 '19 at 17:18

1

try..

>>> df[df.col1.apply(lambda x: x in 'appleorangefruits')]
     col1 col2
0   apple  one
1  orange  two

answered Aug 15 '19 at 17:18

Karn Kumar

8,518
3
27
53

How to check if a string is in a longer string in pandas DataFrame?

5 Answers5