2

I have to extract all the available dates from a PDF and then check among the dates which is Contract Date.

For that first I want to extract all the Dates in the Text that i have extracted from PDF. Now the Dates can be in various formats. I have tried adding all flavours of dates in the below example.

I tried using Datefinder Python module to extract all the dates. Although it comes close but throws few garbage dates initially and also doesn't match the first Date correctly.

import datefinder

dateContent = """ Test
I want to apply for leaves August,​ ​11,​ ​2017 I want to apply for leaves Aug, 23, 2017 I want to apply for leaves Aug, 21, 17 
I want to apply for leaves August 20 2017
I want to apply for leaves August 30th, 2017 I want to apply for leaves August 31st 17
I want to apply for leaves 8/26/2017 I want to apply for leaves 8/27/17
I want to apply for leaves 28/8/2017 I want to apply for leaves 29/8/17 I want to apply for leaves 30/08/17
I want to apply for leaves 15 Jan 17 I want to apply for leaves 14 January 17
I want to apply for leaves 13 Jan 2017
I want to apply for leaves Jan 10 17 I want to apply for leaves Jan 11 2017 I want to apply for leaves January 12 2017
"""

matches = datefinder.find_dates(dateContent)

for match in matches:
    print(match)

Response :

2019-08-05 00:00:00

2019-06-11 00:00:00

2017-06-05 00:00:00

2017-08-23 00:00:00

2017-08-21 00:00:00

2017-08-20 00:00:00

2017-08-30 00:00:00

2017-08-31 00:00:00

2017-08-26 00:00:00

2017-08-27 00:00:00

2017-08-28 00:00:00

2017-08-29 00:00:00

2017-08-30 00:00:00

2017-01-15 00:00:00

2017-01-14 00:00:00

2017-01-13 00:00:00

2017-01-10 00:00:00

2017-01-11 00:00:00

2017-01-12 00:00:00

As you can see, I have 17 such Date objects, but i am getting 19. Checking from bottom, last 16 match correctly. Then there is those initial Garbage. Once i get these Dates correctly, i can move forward with some kind of N-Gram model to check which Dates Context is to Contractual Information.

Any help in resolving the issue would be great.

  • Why is there a downvote ? I have clearly mentioned my requirements with sample code. Also have explicitly mentioned all the available date format i am targetting. – viki tripathi Jun 05 '19 at 11:52

2 Answers2

2

I resolved the issue. Actually there were some encoding issue in my text content.

dateContent = dateContent.replace(u'\u200b', '')

Replacing \u200b with empty character fixed the issue. Datefinder Module does rest of the work of finding all the different Date Formats.

0

This is corpus research. You have to check your data for alternations in date time strings and try to figure out your own customized regular expression for it. If it is natural language resource that you use, and not some system-generated text with distinct patterns of realising the date, you will never get 100 percent recall and precision. It is always a tradeoff.

CLpragmatics
  • 625
  • 6
  • 21
  • Agree, but all the available dateformats are mentioned in the question. There wont be any other combinations. Yes I am looking into writing my own Regex if this Module doesnt workout. – viki tripathi Jun 05 '19 at 11:58