As Scotty1 correctly pointed out, pandas.to_datetime
does in fact work for the use-case I described, however it does not generalize to the use-case where YMD is preferred over DMY (which happens to be the preference in Sweden).
I ended up with something that works in well over 95% of my cases which is much better than what any of the existing date parsing libraries can match out of the box. Here is my solution:
def parse(string):
dmy = ['%d{sep}%m{sep}%Y', '%d{sep}%m{sep}%y']
ymd = ['%Y{sep}%m{sep}%d', '%y{sep}%m{sep}%d']
seperators = ['', ' ', '-', '.', '/']
formats = [f.format(sep=sep) for f in dmy + ymd for sep in seperators]
additional = ['%d/%m %Y']
return dateparser.parse(string, date_formats=formats + additional)
Support for "YMD preferred over DMY" can be achieved by replacing dmy + ymd
with ymd + dmy
.
To help communicate the behaviour of the code above, here is a set of tests that all passes:
out = datetime.datetime(2003, 2, 1, 0, 0)
# straight forward DMY
assert out == extractors.extract_date('010203')
assert out == extractors.extract_date('01022003')
assert out == extractors.extract_date('01-02-03')
assert out == extractors.extract_date('01-02-2003')
# alternative delimiters
assert out == extractors.extract_date('01.02.03')
assert out == extractors.extract_date('01 02 03')
assert out == extractors.extract_date('01/02/03')
assert out == extractors.extract_date('01/02 2003')
# YMD (when the first cannot parse as a day, default to YMD)
assert out == extractors.extract_date('2003-02-01')
assert extractors.extract_date('98-02-01') == \
datetime.datetime(1998, 2, 1, 0, 0)
# single digits
assert out == extractors.extract_date('1-2-2003')
assert out == extractors.extract_date('1/2 2003')
assert out == extractors.extract_date('2003-2-1')
# when there are not other possibilities (MDY, YDM)
assert extractors.extract_date('12-31-98') == \
datetime.datetime(1998, 12, 31, 0, 0)
assert extractors.extract_date('98-31-12') == \
datetime.datetime(1998, 12, 31, 0, 0)