0

I'm trying to match dates in a dataframe with 500 entries using regex:

The dates can appear in the following formats:

04/20/2009; 04/20/09; 4/20/09; 4/3/09
Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
Feb 2009; Sep 2009; Oct 2010
6/2008; 12/2009
2009; 2010

dates[dates[0].str.contains(r'(?P<year>\d?\d?\d\d)')].shape

returns a tuple of shape(500,1)

but

dates[dates[0].str.contains(r'((?P\<day\>(\d?\d)?(\s|-|/|th|st|nd)?)??P\<year\>(\d?\d?\d\d))')].shape

returns a tuple of shape(0,1), but the day group is optional, so shouldnt it still match the year group.

Derek O
  • 16,770
  • 4
  • 24
  • 43

1 Answers1

0

Ok I got it.

The correct regex pattern is: r'((?P<day>(\d?\d)?(\s|-|/|th|st|nd)?)?(?P<year>\d?\d?\d\d))'

The bracket for the year group was at the wrong position.