Let me first share a text:
I am Fox Sin of Greed came on Earth in 1666 BC. due date right after
St. P was build in 16.05.1703 and bluh bluh I moved to Moscow Feb
2nd, 2022 to work as per deadline And today I read manga Due date for
my project is September 12, 2022 I wonder if Ill be able to pay by Oct
07, 2023 and so The deadline is unknown by I assume would be 9102023
Bluh bluh Due Date 12-11-2022 30/08/2021 and 9/19/23
This is a randomly generated text to test dateparser and regex. I wrote a function that is pretty good at recognising dates with regex, but excluding those that are in format [month as letters] [date as number], [year as number] This is where I usually use dateparser as it's capable of recognising those.. However, when there are 'trigger words' such as 'may' 'to pay'(??) and such it fails. Example:
I moved to Moscow Feb 2nd, 2022 to work as per deadline
[('to', datetime.datetime(2022, 9, 8, 0, 0)), ('Feb 2nd, 2022 to', datetime.datetime(2022, 2, 2, 0, 0))]
This is good. It regognised ''Feb 2nd, 2022' even tho added 'to' to 'it'.
But next one:
I wonder if Ill be able to pay by Oct 07, 2023 and so
[('to pay', datetime.datetime(2022, 9, 8, 0, 0)), ('07, 2023', datetime.datetime(2023, 7, 8, 0, 0))]
it failed to connect october to '07, 2023'.
This is used in extracting data from invoices and I have no control over in which formats dates come, so I was wondering if more experienced/skilled dateparser (possibly other python tools) users can help me avoid this problem. Rn it seems to me that I need to avoid words such as 'may', 'to pay', 'now' etc.