4

I'm a newbie and sure this is something silly in my code. In my defense I've tried re-reading through the Python RE documentation here before asking and searching around but don't see a duplicate question so far (which surprised me.)

Outside of a DataFrame I have my re working example here:

x = 'my best friend's birthday is 24 Jan 2001.'
print(re.findall('\d{1,2}\s(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*\s\d{2,4}', x))
<Anaconda console returns:> 24 Jan 2001

But in my Dataframe (df1) I have the following:

index     text
0         My birthday is 2/21/19
1         Your birthday is 4/1/20
2         my best friend's birthday is 24 Jan 2001.   

When I run the following code:

df1['dates'] = df1['text'].str.extract('.*?(\d+[/-]\d+[/-]?\d*).*?|\d{1,2}\s(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*\s\d+')
print('df1['dates'])

I get the following results:

     dates
0    2/21/19
1    4/1/20
2    NaN

I've tried to play around with the parenthesis, rereading the documentation, and some other tweaks that just resulted in endless errors. I'm sure it's an obvious mistake, but I don't see it. Can someone help? Thank you.

Programming_Learner_DK
  • 1,509
  • 4
  • 23
  • 49

1 Answers1

1

You have to have a capture group when using .extract() in pandas. Your capture group before the OR, |, is finding the dates with slashes. But after the OR, you only have a non-capture group.

Here I have placed a capture around the entire search pattern, and each side of the OR also has a non-capturing group.

import pandas as pd

df = pd.DataFrame({'text': ['My birthday is 2/21/19', 
    'Your birthday is 4/1/20', 
    'my best friend\'s birthday is 24 Jan 2001.']})

df.text.str.extract(
    r'((:?\d+[/-]\d+[/-]?\d*)|' + 
    r'(:?\d{1,2}\s(:?Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*\s\d+))', 
    expand=False)[0]

# returns:
0        2/21/19
1         4/1/20
2    24 Jan 2001
James
  • 32,991
  • 4
  • 47
  • 70
  • James, I added one closing parenthesis to your code in the first re statement in the extract to get this to work as expected. Your answer helped me tremendously, thank you: r'((:?(\d+[/-]\d+[/-]?\d*))|' + – Programming_Learner_DK Mar 14 '18 at 10:04