0

Let me first share a text:

I am Fox Sin of Greed came on Earth in 1666 BC. due date   right after
St. P was build in 16.05.1703 and bluh bluh  I moved to Moscow Feb
2nd, 2022 to work as per deadline  And today I read manga Due date for
my project is September 12, 2022 I wonder if Ill be able to pay by Oct
07, 2023 and so  The deadline is unknown by I assume would be 9102023
Bluh bluh Due Date 12-11-2022 30/08/2021 and 9/19/23

This is a randomly generated text to test dateparser and regex. I wrote a function that is pretty good at recognising dates with regex, but excluding those that are in format [month as letters] [date as number], [year as number] This is where I usually use dateparser as it's capable of recognising those.. However, when there are 'trigger words' such as 'may' 'to pay'(??) and such it fails. Example:

I moved to Moscow Feb 2nd, 2022 to work as per deadline

 [('to', datetime.datetime(2022, 9, 8, 0, 0)), ('Feb 2nd, 2022 to', datetime.datetime(2022, 2, 2, 0, 0))]

This is good. It regognised ''Feb 2nd, 2022' even tho added 'to' to 'it'.

But next one:

I wonder if Ill be able to pay by Oct 07, 2023 and so

[('to pay', datetime.datetime(2022, 9, 8, 0, 0)), ('07, 2023', datetime.datetime(2023, 7, 8, 0, 0))]

it failed to connect october to '07, 2023'.

This is used in extracting data from invoices and I have no control over in which formats dates come, so I was wondering if more experienced/skilled dateparser (possibly other python tools) users can help me avoid this problem. Rn it seems to me that I need to avoid words such as 'may', 'to pay', 'now' etc.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    Can't you use regex to detect dates? This is for the mentioned by you format (%M %d, %Y). `[A-Za-z]{3}\s[0-3][0-9](st|nd|rd|th)?,\s2[0-1][0-9]{2}`. You can fine-tune it, because it accepts `34` as day . You can play around with this regexp here: https://regex101.com/r/heWu7y/1 – Mr.TK Sep 08 '22 at 09:47
  • @Mr.TK thanks. I was trying to keep regex to the minimum because it already has so many patterns of dates, but I guess its anavoidable. It is just that some use ',' and some don't. This whole invoice automatisation is a pain because everyone has their own format for dates and some don't even follow the same one though the SAME pdf –  Sep 08 '22 at 10:11
  • btw. I did similar thing parsing an invoice in the past with use of self-learning OCR - it detected keywords and learned how do they look like on the invoices. Not every invoice was ideally parsed, that's why there was a mechanism that allowed to mark what is what on image of the invoice for the user. – Mr.TK Sep 09 '22 at 06:56
  • @Mr.TK what OCR did you use? I tried doing similar thing with NLP (spacy) but results were... Meh –  Sep 09 '22 at 08:35
  • It was Adobe's OCR. Unfortunately it was paid feature back then and probably still is. – Mr.TK Sep 12 '22 at 05:19

1 Answers1

1

If you know language of target text, you might provide it, which should prevent problems caused by bad language guess. After specifying language en I get one date as expected that is

from dateparser.search import search_dates
print(search_dates('I wonder if Ill be able to pay by Oct 07, 2023 and so',languages=['en']))

gives output

[('by Oct 07, 2023 and', datetime.datetime(2023, 10, 7, 0, 0))]

Nonetheless docs claims that

Warning Support for searching dates is really limited and needs a lot of improvement

so you should be prepared that you might still get results not as desired.

Daweo
  • 31,313
  • 3
  • 12
  • 25
  • Thank you! I guess I'll use combination of lang detection and this code! And shouldn't put all my hopes in dateparser. –  Sep 08 '22 at 10:13