5

I am looking for a way to parse dates of unknown formats using the following "meta-formats" in order of preference:

  1. day-month-year (DMY)
  2. year-month-day (YMD)
  3. potentially other formats (but thats not important)

This is the actual meta-formats present on almost all invoices from Norway, Denmark, Finland and The Netherlands, so it should be a common use case. However, it seems like non of the libraries out there is able to handle it without having to define a huge list of possible formats.

To be specific. I need a method (parse) to satisfy the following: parse("01-02-03") == "datetime.datetime(2003, 2, 1, 0, 0)" parse("2003-02-01") == "datetime.datetime(2003, 2, 1, 0, 0)"

But it should also work for other delimiters etc.

Any suggestions on how this can be accomplished without defining a huge list of formats?

Edit: Since Sweden has a different preference I prefer an answer that can be generalized to work for the case where YMD is preferred over DMY.

elgehelge
  • 2,014
  • 1
  • 19
  • 24

5 Answers5

4

Have you tried using pandas? Imho this is the best and cleanest way to import dates, as it works out-of-the-box in 99% of the times while most other things like dateutil tend to fail.

import pandas as pd
pd.to_datetime('01-02-03', dayfirst=True)
pd.to_datetime('2003-02-01', dayfirst=True)

Another advantage of pandas is that it will work with arrays, lists and most other types and even supports string-indexing for arrays (called DataFrames) with datetime-index.

Some more information on how to get the datetime.datetime format with pandas:
Just append .to_pydatetime() to your parser.

pd.to_datetime('2003-02-01', dayfirst=True).to_pydatetime()
# Out[]: datetime.datetime(2003, 2, 1, 0, 0)
JE_Muc
  • 5,403
  • 2
  • 26
  • 41
  • This is solving the problem in a general way. Thanks Scotty. But let me tell you a really sad story. Of course Sweden chose to do things a little different, ie. primarily YMD, but sometimes DMY. Neither pandas or dateutil can be modified using `dayfirst` and `yearfirst` to fit this need. I guess what's really missing is a `monthinmiddle` flag - this would even work for all countries simultaneously. – elgehelge May 09 '18 at 10:25
  • Using `dayfirst=True` should be working on YMD and DMY, as long as the format in YMD is YYYYMMDD. In these cases `dayfirst=True` quite much works like `monthinmiddle`. If you have something like YYMMDD there is **absolutely no way**, any tool can get the correct date without further information (except for `YY>30` or `YY>31`), since the given information is ambiguous. – JE_Muc May 09 '18 at 10:39
  • True, you can never know for sure. But in my use case I always prefer one over the other. And that is not possible using the `dayfirst` flag. – elgehelge May 09 '18 at 13:59
  • Why not? You can set it depending on the input. That is exactly why this flag is existing. You could for example check if the first number has four digits or two digits and then set `dayfirst` accordingly. Or you could implement it so that it uses your preferred format for the first guess and if this throws an exception use another format. But python has no chance of knowing which is your preferred format or the correct date of `01-02-03` without further information. This could be any date. Python is no mind reading wizardry. ;) – JE_Muc May 09 '18 at 14:04
  • I will try to explain. Lets say we to prefer YMD over DMY, and is still interested in parsing a wide variety of unknown formats. Two examples of dates we want to be able to parse is now: '03-02-01' and '01-02-2003' (both as Feb. 1st 2003). Using `pd.to_datetime(DATESTRING, yearfirst=True).to_pydatetime()` fails on the second example, and adding `dayfirst` so that it becomes `pd.to_datetime(DATESTRING, yearfirst=True, dayfirst=True).to_pydatetime()` fails on the first example. There is just no way to do it. The `dayfirst` and `yearfirst` interface is just not powerful enough. – elgehelge May 10 '18 at 17:26
  • I understand that to full extent. But what I wanted to say is: It still seems ALOT easier to me to do something like `pd.to_datetime('03-02-01', yearfirst=True, dayfirst=DMY).to_pydatetime()` with `DMY=False` or `DMY=True`, set depending on which date ist parsed. You just have to set the `dayfirst` and `yearfirst` bool variables correctly, all other stuff that you need is already implemented in pandas. That should be no problem, since you do that checking in your own solution as well (just alot more complicated). No need to do all that work again and write already existing stuff on your own. – JE_Muc May 11 '18 at 12:40
  • The point is I don't know what date I will get, but I do know my preference (order of most likely formats). So I don't know whether I will get '03-02-01' or '01-02-2003' (or something else), but I want to be able to parse both cases as Feb. 1st 2003 without knowing what format the date is in. – elgehelge May 14 '18 at 19:55
  • Yes, I understood that. But why don't you just check for the formats that can't be parsed automatically by hand (there is absolutely no other way than doing that by hand) in the order of your preference and set the bool checkers like `dayfirst` according to your checks? Like in `date_str='01-02-2003'` `if len(date_str.split('-')[-1]) == 4:` `DMY=True` `YF=False` `pd.to_datetime(date_str, yearfirst=YF, dayfirst=DMY).to_pydatetime()` – JE_Muc May 15 '18 at 09:19
  • very late comment, but `pandas` is HUGE, itself being 11.6MB, then even depending on `numpy` (12.1MB), looks like a good definition of excessive overkill. – mariotomo May 25 '18 at 16:25
1

Check out arrow library in python. You can specify the format of the date in whatever format you like. For example:

arrow.get("01-02-03","DD-MM-YY")
# gives <Arrow [2003-02-01T00:00:00+00:00]>
arrow.get("01-02-03","YY-MM-DD")
# gives <Arrow [2001-02-03T00:00:00+00:00]>
Akshay Apte
  • 1,539
  • 9
  • 24
  • 1
    The author of this questions seems to be looking for a generic "magic" function that will guess the format, without the need to specify it. – samu May 09 '18 at 09:19
  • Oh I see. Maybe use a list of known formats and parse the date for each format? – Akshay Apte May 09 '18 at 09:23
  • 1
    Yes, or you could chain them like this `arrow.get('01-02-03', ['DD-MM-YY', 'YY-MM-DD'])` but it quickly comes out of hands when trying to expand to the general case. Just for DMY a bare minimum would look like this `'DDMMYY', 'DD-MM-YY', 'DD/MM/YY', 'DD.MM.YY', 'DDMMYYYY', 'DD-MM-YYYY', 'DD/MM/YYYY', 'DD.MM.YYYY'` (which could be build programmatically of course). This is a good solution if there is no general purpose date parsers out there. Thanks! – elgehelge May 09 '18 at 09:45
  • It turns out `arrow` is not the right library for trying multiple formats sequentially, but other libraries like `dateparser` works well for this. Here is an example that shows why `arrow` fails: `arrow.get('2003-02-01', ['DD-MM-YY', 'YYYY-MM-DD'])` returns `` – elgehelge May 10 '18 at 17:50
0

Have a look at dateutil.parser.parse?

from dateutil.parser import parse

parse('01-02-03', dayfirst=True)  # datetime.datetime(2003, 2, 1, 0, 0)
parse('2003-02-01')  # datetime.datetime(2003, 2, 1, 0, 0)

Of course you need to fine-tune the arguments to parse(), as it will not always guess if it's an YDM or YMD format, but it's a good start. Have a look at the documentation for more examples.

samu
  • 2,870
  • 16
  • 28
  • 1
    This does not work in all cases without specifying the arguments for each case. `parse('2003-02-01', dayfirst=True)` yields `datetime.datetime(2003, 1, 2, 0, 0)`, which is not correct. Omitting `dayfirst=True` yields a false date in the first case. – JE_Muc May 09 '18 at 09:13
  • True, but then again - it's not really easy to determine what the user meant when he typed the date. For example, consider `01/02/03` - how do you know which one is the day, month and year? In the US it would probably be year 2003, month 1, day 2, in Europe it'll be year 2003, month 2, day 1. It's not really possible to guess without additional hints. – samu May 09 '18 at 09:16
  • 1
    He stated that he is mostly using formats used in Norway, Denmark, Finland and The Netherlands, which is always `dayfirst=True` or ISO-date. If there should be a special case of using a US-date, there is no way to parse it without specifying additional information, but this should be considered as a special case. – JE_Muc May 09 '18 at 09:19
  • And the difficulty is that even if you use the most naive case - a list of POSSIBLE date format strings, which you try one after another - it's still possible that you'll get a valid result. For example it's not possible to guess if which is the month from `01/02/2003`, but it is possible to do for `01/13/2003`. If you're parsing users input, you can help yourself by looking at the locale. If it's a website, you can look at the country the request originated from (or at `Accept-Language` headers). In case of invoices, you could use the invoice metadata - the country of the invoice issuer. – samu May 09 '18 at 09:19
  • If you can come up with something that uses locals that is a perfectly valid solution. As I mentioned I do know that the dates are Northern European. – elgehelge May 09 '18 at 09:40
0

As Scotty1 correctly pointed out, pandas.to_datetime does in fact work for the use-case I described, however it does not generalize to the use-case where YMD is preferred over DMY (which happens to be the preference in Sweden).

I ended up with something that works in well over 95% of my cases which is much better than what any of the existing date parsing libraries can match out of the box. Here is my solution:

def parse(string):
    dmy = ['%d{sep}%m{sep}%Y', '%d{sep}%m{sep}%y']
    ymd = ['%Y{sep}%m{sep}%d', '%y{sep}%m{sep}%d']
    seperators = ['', ' ', '-', '.', '/']
    formats = [f.format(sep=sep) for f in dmy + ymd for sep in seperators]
    additional = ['%d/%m %Y']
    return dateparser.parse(string, date_formats=formats + additional)

Support for "YMD preferred over DMY" can be achieved by replacing dmy + ymd with ymd + dmy.

To help communicate the behaviour of the code above, here is a set of tests that all passes:

out = datetime.datetime(2003, 2, 1, 0, 0)

# straight forward DMY
assert out == extractors.extract_date('010203')
assert out == extractors.extract_date('01022003')
assert out == extractors.extract_date('01-02-03')
assert out == extractors.extract_date('01-02-2003')

# alternative delimiters
assert out == extractors.extract_date('01.02.03')
assert out == extractors.extract_date('01 02 03')
assert out == extractors.extract_date('01/02/03')
assert out == extractors.extract_date('01/02 2003')

# YMD (when the first cannot parse as a day, default to YMD)
assert out == extractors.extract_date('2003-02-01')
assert extractors.extract_date('98-02-01') == \
    datetime.datetime(1998, 2, 1, 0, 0)

# single digits
assert out == extractors.extract_date('1-2-2003')
assert out == extractors.extract_date('1/2 2003')
assert out == extractors.extract_date('2003-2-1')

# when there are not other possibilities (MDY, YDM)
assert extractors.extract_date('12-31-98') == \
    datetime.datetime(1998, 12, 31, 0, 0)
assert extractors.extract_date('98-31-12') == \
    datetime.datetime(1998, 12, 31, 0, 0)
elgehelge
  • 2,014
  • 1
  • 19
  • 24
0

I tried pandas and I was puzzled when I saw it downloading (11.6MB) and what was my surprise when it started downloading numpy (12.1MB), too.

But as a European, I do not need the default "month-first" behaviour of dateutil, so I am now using this:

import re
sloppy_iso8601 = re.compile('^[12][0-9][0-9][0-9]-[0-9][0-9]?-[0-9][0-9]?.*$')
import dateutil.parser

def parse_date(value, dayfirst=True, yearfirst=False, **kwargs):
    if sloppy_iso8601.match(value) is not None:
        dayfirst = False
        yearfirst = True
    return dateutil.parser.parse(value, dayfirst=dayfirst, yearfirst=yearfirst, **kwargs)

which behaves as the OP (and myself) expect.

>>> parse = parse_date
>>> parse("01-02-03")
datetime.datetime(2003, 2, 1, 0, 0)
>>> parse("2003-02-01")
datetime.datetime(2003, 2, 1, 0, 0)
>>> 
mariotomo
  • 9,438
  • 8
  • 47
  • 66