0

I'm using datefinder, which extracts possible date / time strings from a piece of text, then puts those strings through dateutil.parser.parse() to yield datetimes.

In a key use case, I have multiple resulting datetimes which probably correspond to the same date. I'd like to post process (how is not the subject here) to improve accuracy in the parsed dates.

To do that, I need to get an indication of confidence in the parsed values.

In a perfect world I'd be able to do:

dt, confidence = dateutil.parser.parse_with_confidence('12/12/2017')
print(confidence)
>> {'year': 1, 'month': 0.5, 'day': 0.5, 'hours': 0, 'minutes': 0, 'seconds': 0, 'tz': 0}

dt, confidence = dateutil.parser.parse_with_confidence('12/Dec/2017')
print(confidence)
>> {'year': 1, 'month': 1, 'day': 1, 'hours': 0, 'minutes': 0, 'seconds': 0, 'tz': 0}

Which would allow me to choose the second one, having a greater confidence in its month and day values.

Partial workaround: I can call dateutil twice, with a totally different default date...

dt_a = dateutil.parser.parse('Dec2017', default=datetime(2017, 1, 1, 0, 0, 0))
>> 2017-12-01 00:00:00
dt_b = dateutil.parser.parse('Dec2017', default=datetime(2018, 2, 2, 1, 1, 1))
>> 2017-12-02 01:01:01

...then evaluate which values were assigned by the parser and which defaulted.

This is some way to achieving what I need, as it lets me choose between dates generated from (say) strings 'dec2017' and '11dec2017' (where one contains more information than the other).

But, it's crude parsing twice, and it's actively misleading since in the above case, the latter has more information, but contains ambiguity.

Is there a module out there to do this? Or am I facing modifying dateutils?

thclark
  • 4,784
  • 3
  • 39
  • 65
  • There is no such notion of confidence, and this is a harder problem than you might think – Paul Jan 11 '18 at 19:52
  • Pretty sure those `default` arguments are wrong, too. – Paul Jan 11 '18 at 19:54
  • Fixed the defaults. @Paul there must be a notion of confidence - if I parse '11/12/2017' then I'm approx. 50% confident that the result is correct. Perhaps you could elaborate on some of the pitfalls you envisage? – thclark Jan 12 '18 at 13:17
  • No intermediate representation exists in the parser that keeps track of these things, and doing so would be somewhat difficult the way it is currently written. It's not like the tokens are parsed out and then assigned a probability. – Paul Jan 13 '18 at 01:51
  • Your notion of confidence is interesting but somewhat incomplete. There's no real way to *use* the fractional confidences because you don't know what the alternative ways to parse those tokens is, or even what they were to start with. – Paul Jan 13 '18 at 01:53

0 Answers0