3

I have some text, taken from different websites, that I want to extract dates from. As one can imagine, the dates vary substantially in how they are formatted, and look something like:

Posted: 10/01/2014 
Published on August 1st 2014
Last modified on 5th of July 2014
Posted by Dave on 10-01-14

What I want to know is if anyone knows of a Python library [or API] which would help with this - (other than e.g. regex, which will be my fallback). I could probably relatively easily remove the "posed on" parts, but getting the other stuff consistent does not look easy.

kyrenia
  • 5,431
  • 9
  • 63
  • 93
  • 3
    `dateutil` has a [parser](https://dateutil.readthedocs.org/en/latest/parser.html) that is pretty forgiving with regards to what date formats you throw at it. That would be my first choice for something like this. – Lukas Graf Apr 16 '15 at 18:52
  • 1
    @LukasGraf - thanks, this worked well - I have posed my code as an answer using this (ran into a couple of problems on what it defaulted to when it ran into missing data). – kyrenia Apr 16 '15 at 20:01

2 Answers2

2

My solution using dateutil

Following Lukas's suggestion, I used the dateutil package (seemed far more flexible than Arrow), using the Fuzzy entry, which basically ignores things which are not dates.

Caution on Fuzzy parsing using dateutil

The main thing to note with this is that as noted in the thread Trouble in parsing date using dateutil if it is unable to parse a day/month/year it takes a default value (which is the current day, unless specified), and as far as i can tell there is no flag reported to indicate that it took the default.

This would result in "random text" returning today's date of 2015-4-16 which could have caused problems.

Solution

Since I really want to know when it fails, rather than fill in the date with a default value, I ended up running twice, and then seeing if it took the default on both instances - if not, then I assumed parsing correctly.

from datetime import datetime
from dateutil.parser import parse

def extract_date(text):

    date = {}
    date_1 = parse(text, fuzzy=True, default=datetime(2001, 01, 01))
    date_2 = parse(text, fuzzy=True, default=datetime(2002, 02, 02))

    if date_1.day == 1 and date_2.day ==2:
        date["day"] = "XX"
    else:
        date["day"] = date_1.day

    if date_1.month == 1 and date_2.month ==2:
        date["month"] = "XX"
    else:
        date["month"] = date_1.month    

    if date_1.year == 2001 and date_2.year ==2002:
        date["year"] = "XXXX"
    else:
        date["year"] = date_1.year  

    return(date)

print extract_date("Posted: by dave August 1st")

Obviously this is a bit of a botch (so if anyone has a more elegant solution -please share), but this correctly parsed the four examples i had above [where it assumed US format for the date 10/01/2014 rather than UK format], and resulted in XX being returned appropriately when missing data entered.

Community
  • 1
  • 1
kyrenia
  • 5,431
  • 9
  • 63
  • 93
  • Although it works, `fuzzy="TRUE"` is a bit misleading, why not just `fuzzy=True` ? – Liviu Chircu Sep 10 '15 at 08:41
  • 1
    @LiviuChircu - no reason... except that i wrote this when was just learning how to code - updated as you suggested to be cleaner – kyrenia Sep 10 '15 at 18:28
0

You could use Arrow library:

arrow.get('2013-05-05 12:30:45', ['MM/DD/YYYY', 'MM-DD-YYYY'])

Two arguments, first a str to parse and second a list of formats to try.

dizballanze
  • 1,267
  • 8
  • 18