0

I am parsing date from a pdf document that has other date-like formats

Traceback (most recent call last):
  File "/Users/akjain/Documents/workspace/Parse13F/13FParser.py", line 26, in <module>
    print dparser.parse('  Crl. A. Nos. 291/16, 300/16, 581/16 & 1143/16 Judgment reserved on :   May 31, 2017  ', fuzzy=True)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/dateutil/parser.py", line 697, in parse
    return DEFAULTPARSER.parse(timestr, **kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/dateutil/parser.py", line 303, in parse
    raise ValueError, "unknown string format"
ValueError: unknown string format

My input is

print dparser.parse('  Crl. A. Nos. 291/16, 300/16, 581/16 & 1143/16 Judgment reserved on :   May 31, 2017  ', fuzzy=True)

and if I remove "291/16, 300/16, 581/16 & 1143/16" from the string, the code runs perfectly.

Can anyone help me with parsing date while ignoring above values.

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
Akhil
  • 1

3 Answers3

1

It may be that the library is getting confused because it's seeing multiple date-like components in the string. If you know your dates will look like May 31, 2017 and that the false positives will look like 581/16, you can apply a regex to the string to clean it up before doing the fuzzy parsing:

import re

string = '  Crl. A. Nos. 291/16, 300/16, 581/16 & 1143/16 Judgment reserved on :   May 31, 2017  '
string = re.sub('[\d]+/[\d]+', '', s)
print dparser.parse(string, fuzzy=True)

If instead you want to define the structure of the dates you are parsing for, you can use regular expressions in a different way:

import re

s = 'test 234/23/134 234 291/16, 300/16, 581/16 & 1143/16 May 31, 2017 10/15/1997'
match_1 = re.search(r'[A-Za-z]+ [\d]{1,2}, [\d]{4}', s)
print match_1.group(0)
# => May 31, 2017
match_2 = re.search(r'[\d]{2}/[\d]{2}/[\d]{4}', s)
print match_2.group(0)
# => 10/15/1997

You can even combine the two to extract all the dates that shows up in a given line for your expected patterns:

import re

pattern_1 = r'[A-Za-z]+ [\d]{1,2}, [\d]{4}'
pattern_2 = r'[\d]{2}/[\d]{2}/[\d]{4}'
matches = re.findall(r'{}|{}'.format(pattern_1, pattern_2), s)
print matches
# => ['May 31, 2017', '10/15/1997']
Daniel Corin
  • 1,987
  • 2
  • 15
  • 27
  • Thanks danielcorin... This certainly helped... however, another challenge is that my document has a lot of other numbers also of the following format: 234/23/134 or 234. now filering these two formats might also filter the date of the format 23/12/2017... So, is there a way where I can just define what my date might look like rather than filtering the non-date like data – Akhil Jul 22 '17 at 07:02
  • Yes you can, with regex as well. See the edits above – Daniel Corin Jul 22 '17 at 17:32
  • This is absolutely what is happening. – Paul Jul 22 '17 at 19:55
0

Use a try statement with an except clause, for instance:

try:
    print dparser.parse('...')
except ValueError as ve:
    print('ValueError: {}'.format(ve))
stovfl
  • 14,998
  • 7
  • 24
  • 51
0

Since you know what date format works with that parser you can use code based on a regex to convert other date format to that format and also to remove items that confuse the parser.

In this explanatory example I have added the date '23/12/2017' near the beginning of the string you were working with. This code watches for the patterns indicated in the re.sub and passes matching strings to process. process removes any that consist of fewer that three parts. Then it attempts to create a date from the three numbers in the match that it has been passed. If this succeeds then it formats this date as indicated in the output so that the parser should be able to recognise it. I've used the arrow library in conjunction with datetime for these date manipulations.

>>> import re
>>> s = 'On 23/12/2017  Crl. A. Nos. 291/16, 300/16, 581/16 & 1143/16 Judgment reserved on :   May 31, 2017  '
>>> from datetime import datetime
>>> import arrow
>>> def process(matchobj):
...     items = matchobj.group(0).split('/')[::-1]
...     items = [int(_) for _ in items]
...     if len(items)<3:
...         return ''
...     try:
...         the_date = arrow.get(datetime(*items))
...         return the_date.format('MMMM DD, YYYY')
...     except:
...         return matchobj.group(0)
... 
>>> re.sub(r'(?:\d+/)+\d+', process, s)
'On December 23, 2017  Crl. A. Nos. , ,  &  Judgment reserved on :   May 31, 2017  '
Bill Bell
  • 21,021
  • 5
  • 43
  • 58