0

I want to extract dates in the format Month Date Year.

For example: 14 January, 2005 or Feb 29 1982

the code im using: date = re.findall(r'\d{1,3} Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec|January|February|March|April|May|June|July|August|September|October|November|December \d{1,3}[, ]\d{4}',line)

python inteprets this as 1-2 digits and Jan or each of the months. So it would match with only "Feb" or "12 Jan", but not the rest of it

So how do I group ONLY the Months in a way where i can use the | only for the months but not the rest of the expression

  • You say that you want to extract dates in the format Month Date Year, but give two different formats. Do you instead mean "extract dates in the following formats and then convert to Month Date Year"? – soyapencil Jul 20 '20 at 17:22
  • Yes. I just want to extract the date itself in order to later convert it to Month Date Year: So 14, Jan 2013 -> 14 Jan 2013 – HeyitsPohkee Jul 20 '20 at 20:32

1 Answers1

0

Answering your question directly, you can make two regexps for your "Day Month Year" and "Month Day Year" formats, then check them separately.

import datetime

# Make months using list comp
months_shrt = [datetime.date(1,m,1).strftime('%b') for m in range(1,13)]
months_long = [datetime.date(1,m,1).strftime('%B') for m in range(1,13)]

# Join together
months = months_shrt + months_long
months_or = f'({"|".join(months)})'

expr_dmy = '\d{1,3},? ' + months_or + ',? \d{4}'
expr_mdy = months_or + ',? \d{1,3},? \d{4}'

You can try both out and see which one matches. However, you'll still need to inspect it and convert it to your favourite flavour of date format.

Instead, I would advise not using regexp at all, and simply try different date formats.

str_a = ' ,'
str_b = ' ,'

base_fmts = [('%d', '%b', '%Y'),
             ('%d', '%B', '%Y'),
             ('%b', '%d', '%Y'),
             ('%B', '%d', '%Y')]

def my_formatter(s):
    for o in base_fmts:
        for i in range(2):
            for j in range(2):
                # Concatenate
                fmt = f'{o[0]}{str_a[i]} '
                fmt += f'{o[1]}{str_b[j]} '
                fmt += f'{o[2]}'
    
                try:
                    d = datetime.datetime.strptime(s, fmt)
                except ValueError:
                    continue
                else:
                    return d

The function above will take a string and return a datetime.datetime object. You can use standard datetime.datetime methods to get your day, month and year back.

>>> d = my_formatter('Jan 15, 2009')
>>> (d.month, d.day, d.year)
(1, 15, 2009)
soyapencil
  • 509
  • 4
  • 9