-2

I have few strings like below :

'Thursday;60 days;Monday, days;the last two years;the six months;October 2017;March 2018;three days;Jan. 4;Last year;Dec. 21;'

expected result : October 2017

'January 7;30;39;24;46;1750;April 2017;April 30;February;'

expected result : April 2017

'Thursday;a day;another six days;the day;Tuesday;three days;mid-October;Wednesday;'

expected result : mid-October

I know the string is completely unstructured but can we have a python code to get the dates even from these ?

This is a part of a NER model where I am trying to extract the data entities.

I have tried a few methods but those were not even close to the result as string doesn't have a proper pattern

Chris
  • 29,127
  • 3
  • 28
  • 51
Laster
  • 388
  • 5
  • 18
  • 1
    What's the logic? – Michał Turczyn Sep 26 '19 at 05:21
  • Sorry,I am not sure what are you asking ? – Laster Sep 26 '19 at 06:05
  • atleast can we extract out the string part based on month name ? – Laster Sep 26 '19 at 06:16
  • 1
    For the first string, I get `['October 2017', 'March 2018', 'Jan. 4', 'Dec. 21']`. For the second string, I get `['January 7', 'April 2017', 'April 30']` and no matches for the third one - using [datefinder](https://github.com/akoumjian/datefinder). Probably, you need to pre-process the text if you want to get something like `mid-october`, but you need to come up with specs. – Wiktor Stribiżew Sep 26 '19 at 08:07

1 Answers1

2

You may use datefinder with a regex to check for month names in the found date time strings:

import datefinder, re
from datetime import datetime

strs = ['Thursday;60 days;Monday, days;the last two years;the six months;October 2017;March 2018;three days;Jan. 4;Last year;Dec. 21;',
        'January 7;30;39;24;46;1750;April 2017;April 30;February;',
        'Thursday;a day;another six days;the day;Tuesday;three days;mid-October;Wednesday;']

day_of_week_rx = re.compile(r'(?:A(?:pr(?:il)?|ug(?:ust)?)|Dec(?:ember)?|Feb(?:ruary)?|J(?:an(?:uary)?|u(?:ly|ne|[ln]))|Ma(?:rch|[ry])|Nov(?:ember)?|Oct(?:ober)?|Sep(?:tember)?)', re.I)
for s in strs:
    raw_dates = list(datefinder.find_dates(s, source=True))
    print([y for x,y in raw_dates if day_of_week_rx.search(y)])

Output:

['October 2017', 'March 2018', 'Jan. 4', 'Dec. 21']
['January 7', 'April 2017', 'April 30']
[]

Note that mid-October cannot be cast to a valid date time thus it is not extracted. You will need to apply some more specific regex like re.search(r'\b(?:half|mid)-(?:A(?:pr(?:il)?|ug(?:ust)?)|Dec(?:ember)?|Feb(?:ruary)?|J(?:an(?:uary)?|u(?:ly|ne|[ln]))|Ma(?:rch|[ry])|Nov(?:ember)?|Oct(?:ober)?|Sep(?:tember)?)', text).

The (?:A(?:pr(?:il)?|ug(?:ust)?)|Dec(?:ember)?|Feb(?:ruary)?|J(?:an(?:uary)?|u(?:ly|ne|[ln]))|Ma(?:rch|[ry])|Nov(?:ember)?|Oct(?:ober)?|Sep(?:tember)?) matches English month full and abbreviated names.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563