parsing strings for specific elements - python

Question

I have a pandas data-frame that contains a column of sentences with pattern: row 1 of column : "ID is 123 or ID is 234 or ID is 345" row 2 of column : "ID is 123 or ID is 567 or ID is 876" row 3 of column : "ID is 567 or ID is 567 or ID is 298".

My aim is to extract the numbers in each row and save them in a list or numpy array. Since there is a pattern (the number always comes after "ID is", I thought that regex might be the best way to go for it (but I am not sure how to use regex for multiple extractions in 1 string.

Any advice?

Can there be more or less IDs per row, or is it always exactly 3? — Bill the Lizard, Feb 15 '22 at 22:32
Is the rest of the text really relevant, or do you just want to extract numbers? In other words, might there be numbers you *don't* want? If not a simple regex will pull out the numbers. — Mark, Feb 15 '22 at 22:34
Yes, each row could have a different number of IDs, but the pattern is the same. The number always comes after the sentence "ID is" — KarimZ, Feb 15 '22 at 22:36
The rest of the text is not relevant; all I need is the IDs that I will use as a merge to another table. There are no other numbers. — KarimZ, Feb 15 '22 at 22:37
The output list/ array that I want to have from the example given is[ [123,234,345],[123,567,876],[567,567,298]] — KarimZ, Feb 15 '22 at 22:39
I would go with `\d+` seems to match digits just fine. Where `\d` is a regex class representing a number, and `+` is a 1 or more quantifier. — sln, Feb 15 '22 at 22:39
you could create minimal working code with example data in columns - so we could simply copy and run it - and then we could create solution. — furas, Feb 16 '22 at 00:37

furas · Accepted Answer · 2022-02-16T01:48:38.890

Standard module re can use '\d+'

re.findall('\d+', "ID is 123 or ID is 234 or ID is 345")

to get list [123,234,345].

To make sure you can also use 'ID is (\d+)'

re.findall('ID is (\d+)', "ID is 123 or ID is 234 or ID is 345")

In DataFrame you can use .str.findall() to do the same for all rows.

import pandas as pd


df = pd.DataFrame({
  'ID': [
    "ID is 123 or ID is 234 or ID is 345",
    "ID is 123 or ID is 567 or ID is 876",
    "ID is 567 or ID is 567 or ID is 298",
  ]
})

print('\n--- before ---\n')
print(df)
 
df['result'] = df['ID'].str.findall('ID is (\d+)')

print('\n--- after ---\n')
print(df)

Result:

--- before ---

                                    ID
0  ID is 123 or ID is 234 or ID is 345
1  ID is 123 or ID is 567 or ID is 876
2  ID is 567 or ID is 567 or ID is 298

--- after ---

                                    ID           result
0  ID is 123 or ID is 234 or ID is 345  [123, 234, 345]
1  ID is 123 or ID is 567 or ID is 876  [123, 567, 876]
2  ID is 567 or ID is 567 or ID is 298  [567, 567, 298]

If you need only column result as numpy array then you can get df['result'].values.

And if you need as nested list: df['result'].values.tolist().

parsing strings for specific elements - python

1 Answers1