0

I am using the python wrapper of NLP Stanford's SUTime. So far comparing the results to other date parsers like duckling, dateparser's search_dates, parsedatetime and natty, SUTime gives the most reliable results.

However, it fails to capture some obvious dates from documents. Following are the 2 types of documents that I am having difficult parsing for dates using SUTime.

  1. I am out and I won't be available until 9/19
  2. I am out and I won't be available between (September 18-September 20)

It gives no results in case of the first document. However, for the second document, it only captures the month but not the date or date range.

I tried wrapping my head around the java's code to see if I could alter or add some rules to make this work, but couldn't figure it out.

If someone can suggest a way to make this work with SUTime, it would be really helpful.

Also, I tried dateparser's search_dates, and it is unreliable as it captures anything and everything. Like for the first document it would parse a date on text "am out" (which is not required) and "9/19" (which is okay). So if there is a way to control this behavior it would work as well.

Afsan Abdulali Gujarati
  • 1,375
  • 3
  • 18
  • 30

1 Answers1

1

Question: Unable to capture certain date formats

This solution does use datetime instead of SUTime

import datetime

def datetime_from_string(datestring):
    rules = [('(\d{1,2}\/\d{1,2})', '%m/%d', {'year': 2018}), ('(\w+ \d{1,2})-(\w+ \d{1,2})', '%B %d', {'year': 2018})]
    result = None
    for rule in rules:
        match = re.match(rule[0], datestring)
        if match:
            result = []
            for part in match.groups():
                try:
                    date = datetime.strptime(part, rule[1])
                    if rule[2]:
                        for key in rule[2]:
                            if key == 'year':
                                date = datetime(rule[2][key], date.month, date.day)

                    result.append(date)
                except ValueError:
                    pass
            return result

    # If you reach heare, NO matching rule
    raise ValueError("Datestring '{}', does not match any rule!".format(datestring))

# Usage

for datestring in ['9/19', 'September 18-September 20', '2018-09-01']:
    result = datetime_from_string(datestring)
    print("str:{} result:{}".format(datestring, result))

Output:

str:'9/19' result:[datetime.datetime(2018, 9, 19, 0, 0)]
str:'September 18-September 20' result:[datetime.datetime(2018, 9, 18, 0, 0), datetime.datetime(2018, 9, 20, 0, 0)
ValueError: Datestring '2018-09-01', does not match any rule!

Tested with Python: 3.4.2

stovfl
  • 14,998
  • 7
  • 24
  • 51
  • Thank you for the response, but the 2 sentences I suggested are just a couple of scenarios which the SUTime fails to capture. Using a rule-based system as suggested in your answer, would only capture those 2 formats of dates. Unless you are suggesting to you this over the top of the actual SUTime. – Afsan Abdulali Gujarati Oct 07 '18 at 02:07
  • 1
    Yes, as fallback if `SUTime` fails. `SUTime` is also **rule-based**, dosn't handle the keywords `until` and `between`. A simple workaround, do `replace('between', 'from')`. As `SUTime` are written in `Java`, you have to extend this in the `Java` source. I delete this Answer. – stovfl Oct 07 '18 at 08:57