I am trying to extract dates from email texts using datefinder
python library.
Below is a the code snippet of what I am trying to do.
import datefinder
#body has list of email texts
email_dates=[]
for b in body:
dates = datefinder.find_dates(b)
date = []
for d in dates:
date.append(d)
email_dates.append(date)
datefinder tries to construct all the numbers in the email to dates. I get lot of false positives. I can remove those using some logic. But i get IllegalMonthError
in some email and i am unable to go past the error and retrieve dates from other emails. Below is the error
---------------------------------------------------------------------------
IllegalMonthError Traceback (most recent call last)
c:\python\python38\lib\site-packages\dateutil\parser\_parser.py in parse(self, timestr, default, ignoretz, tzinfos, **kwargs)
654 try:
--> 655 ret = self._build_naive(res, default)
656 except ValueError as e:
c:\python\python38\lib\site-packages\dateutil\parser\_parser.py in _build_naive(self, res, default)
1237
-> 1238 if cday > monthrange(cyear, cmonth)[1]:
1239 repl['day'] = monthrange(cyear, cmonth)[1]
c:\python\python38\lib\calendar.py in monthrange(year, month)
123 if not 1 <= month <= 12:
--> 124 raise IllegalMonthError(month)
125 day1 = weekday(year, month, 1)
IllegalMonthError: bad month number 42; must be 1-12
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
<ipython-input-39-1fbacc8ca3f6> in <module>
7 dates = datefinder.find_dates(b)
8 date = []
----> 9 for d in dates:
10 date.append(d)
11
c:\python\python38\lib\site-packages\datefinder\__init__.py in find_dates(self, text, source, index, strict)
30 ):
31
---> 32 as_dt = self.parse_date_string(date_string, captures)
33 if as_dt is None:
34 ## Dateutil couldn't make heads or tails of it
c:\python\python38\lib\site-packages\datefinder\__init__.py in parse_date_string(self, date_string, captures)
100 # otherwise self._find_and_replace method might corrupt them
101 try:
--> 102 as_dt = parser.parse(date_string, default=self.base_date)
103 except (ValueError, OverflowError):
104 # replace tokens that are problematic for dateutil
c:\python\python38\lib\site-packages\dateutil\parser\_parser.py in parse(timestr, parserinfo, **kwargs)
1372 return parser(parserinfo).parse(timestr, **kwargs)
1373 else:
-> 1374 return DEFAULTPARSER.parse(timestr, **kwargs)
1375
1376
c:\python\python38\lib\site-packages\dateutil\parser\_parser.py in parse(self, timestr, default, ignoretz, tzinfos, **kwargs)
655 ret = self._build_naive(res, default)
656 except ValueError as e:
--> 657 six.raise_from(ParserError(e.args[0] + ": %s", timestr), e)
658
659 if not ignoretz:
TypeError: unsupported operand type(s) for +: 'int' and 'str'
Suppose if i am getting this error in the 5th email, I will not be able to retrieve dates from 5th onwards. How to bypass this error, remove the entries causing this error and retrieve all other dates?
Thanks in Advance