Filter for a string followed by a random row of numbers

Question

I have a row that I would like to filter for in a dataframe.

ch=b611067=football

My question is I would like to just filter for the b'611067 section.

I understand I can use the follow str.startswith('b') to find the start of the ID but what I am looking for is a way to say something like str.contains('random 6 digit numberical value'

Hope this makes sense.

Not familiar with pandas. if you can use regex, try a pattern like `'b[0-9]{6}'` — tobias_k, Apr 29 '19 at 11:23
Is it possible to add some example data so we can reproduce a solution for you? — Erfan, Apr 29 '19 at 11:25
Possible duplicate of [How to filter rows in pandas by regex](https://stackoverflow.com/questions/15325182/how-to-filter-rows-in-pandas-by-regex) — 3UqU57GnaX, Apr 29 '19 at 11:34
Try pandas.str.extract: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html — hacker315, Apr 29 '19 at 11:58

score 2 · Answer 1 · answered Apr 29 '19 at 11:28

I am not sure (yet) how to do this efficiently in pandas, but you can use regex for the match:

import re

pattern = '(b\d{6})'
text = 'ch=b611067=football'
matches = re.findall(pattern=pattern, string=text)
for match in matches:
    pass # do something

Edit: this answer explains how to use regex with pandas: How to filter rows in pandas by regex

sgvd · Answer 2 · 2019-04-29T11:48:59.780

You can use the .str accessor to use string functions on string columns, including matching by regexp:

import pandas as pd
df = pd.DataFrame(data={"foo": ["us=b611068=handball", "ch=b611067=football", "de=b611069=hockey"]})
print(df.foo.str.match(r'.+=b611067=.+'))

Output:

0    False
1     True
2     False
Name: foo, dtype: bool

You can use this to index the dataframe, so for instance:

print(df[df.foo.str.match(r'.+=b611067=.+')])

Output:

                   foo
1  ch=b611067=football

If you want all rows that match the pattern b<6 numbers>, you can use the expression provided by tobias_k:

df.foo.str.match(r'.+=b[0-9]{6}=.+')

Note, this gives the same result as df.foo.str.contains(r'=b611067=') which doesn't require you to provide the wildcards and is the solution given in How to filter rows in pandas by regex, but as mentioned in the Pandas docs, with match you can be stricter.

Filter for a string followed by a random row of numbers

2 Answers2