I'm using datefinder, which extracts possible date / time strings from a piece of text, then puts those strings through dateutil.parser.parse()
to yield datetime
s.
In a key use case, I have multiple resulting datetime
s which probably correspond to the same date. I'd like to post process (how is not the subject here) to improve accuracy in the parsed dates.
To do that, I need to get an indication of confidence in the parsed values.
In a perfect world I'd be able to do:
dt, confidence = dateutil.parser.parse_with_confidence('12/12/2017')
print(confidence)
>> {'year': 1, 'month': 0.5, 'day': 0.5, 'hours': 0, 'minutes': 0, 'seconds': 0, 'tz': 0}
dt, confidence = dateutil.parser.parse_with_confidence('12/Dec/2017')
print(confidence)
>> {'year': 1, 'month': 1, 'day': 1, 'hours': 0, 'minutes': 0, 'seconds': 0, 'tz': 0}
Which would allow me to choose the second one, having a greater confidence in its month and day values.
Partial workaround:
I can call dateutil
twice, with a totally different default
date...
dt_a = dateutil.parser.parse('Dec2017', default=datetime(2017, 1, 1, 0, 0, 0))
>> 2017-12-01 00:00:00
dt_b = dateutil.parser.parse('Dec2017', default=datetime(2018, 2, 2, 1, 1, 1))
>> 2017-12-02 01:01:01
...then evaluate which values were assigned by the parser and which defaulted.
This is some way to achieving what I need, as it lets me choose between dates generated from (say) strings 'dec2017'
and '11dec2017'
(where one contains more information than the other).
But, it's crude parsing twice, and it's actively misleading since in the above case, the latter has more information, but contains ambiguity.
Is there a module out there to do this? Or am I facing modifying dateutils
?