-1

I have a pandas data-frame that contains a column of sentences with pattern: row 1 of column : "ID is 123 or ID is 234 or ID is 345" row 2 of column : "ID is 123 or ID is 567 or ID is 876" row 3 of column : "ID is 567 or ID is 567 or ID is 298".

My aim is to extract the numbers in each row and save them in a list or numpy array. Since there is a pattern (the number always comes after "ID is", I thought that regex might be the best way to go for it (but I am not sure how to use regex for multiple extractions in 1 string.

Any advice?

KarimZ
  • 9
  • 4
  • Can there be more or less IDs per row, or is it always exactly 3? – Bill the Lizard Feb 15 '22 at 22:32
  • Is the rest of the text really relevant, or do you just want to extract numbers? In other words, might there be numbers you *don't* want? If not a simple regex will pull out the numbers. – Mark Feb 15 '22 at 22:34
  • Yes, each row could have a different number of IDs, but the pattern is the same. The number always comes after the sentence "ID is" – KarimZ Feb 15 '22 at 22:36
  • The rest of the text is not relevant; all I need is the IDs that I will use as a merge to another table. There are no other numbers. – KarimZ Feb 15 '22 at 22:37
  • The output list/ array that I want to have from the example given is[ [123,234,345],[123,567,876],[567,567,298]] – KarimZ Feb 15 '22 at 22:39
  • 1
    I would go with `\d+` seems to match digits just fine. Where `\d` is a regex class representing a number, and `+` is a 1 or more quantifier. – sln Feb 15 '22 at 22:39
  • you could create minimal working code with example data in columns - so we could simply copy and run it - and then we could create solution. – furas Feb 16 '22 at 00:37

1 Answers1

0

Standard module re can use '\d+'

re.findall('\d+', "ID is 123 or ID is 234 or ID is 345")

to get list [123,234,345].

To make sure you can also use 'ID is (\d+)'

re.findall('ID is (\d+)', "ID is 123 or ID is 234 or ID is 345")

In DataFrame you can use .str.findall() to do the same for all rows.

import pandas as pd


df = pd.DataFrame({
  'ID': [
    "ID is 123 or ID is 234 or ID is 345",
    "ID is 123 or ID is 567 or ID is 876",
    "ID is 567 or ID is 567 or ID is 298",
  ]
})

print('\n--- before ---\n')
print(df)
 
df['result'] = df['ID'].str.findall('ID is (\d+)')

print('\n--- after ---\n')
print(df)

Result:

--- before ---

                                    ID
0  ID is 123 or ID is 234 or ID is 345
1  ID is 123 or ID is 567 or ID is 876
2  ID is 567 or ID is 567 or ID is 298

--- after ---

                                    ID           result
0  ID is 123 or ID is 234 or ID is 345  [123, 234, 345]
1  ID is 123 or ID is 567 or ID is 876  [123, 567, 876]
2  ID is 567 or ID is 567 or ID is 298  [567, 567, 298]

If you need only column result as numpy array then you can get df['result'].values.

And if you need as nested list: df['result'].values.tolist().

furas
  • 134,197
  • 12
  • 106
  • 148