0

I have seen some posts that deal with recognizing if a possible string could be a date or not,but none of them seem to deal with if a sentence could have potential dates in it or not.

I have used the dateutil parser function which seems to be effective in recognizing dates in strings only if the date is the only component of the string.

from dateutil.parser import parse

def is_date(string, fuzzy=False):
    """
    Return whether the string can be interpreted as a date.

    :param string: str, string to check for date
    :param fuzzy: bool, ignore unknown tokens in string if True
    """
    try: 
        parse(string, fuzzy=fuzzy)
        return True

    except ValueError:
        return False

>>> is_date("1990-12-1")
True
>>> is_date("foo 1990-12-1 bar")
False
TheHedge
  • 107
  • 1
  • 1
  • 9

3 Answers3

1

One solution is to split the string and then test each part, returning True if any of the split strings successfully parses to a date.

def is_date(string, fuzzy=False):
    """
    Return whether the string can be interpreted as a date.

    :param string: str, string to check for date
    :param fuzzy: bool, ignore unknown tokens in string if True
    """
    def parse_date(date_string):
        try: 
            return parse(date_string, fuzzy=fuzzy)
        except ValueError:
            return False

    return any(parse_date(s) for s in string.split())

>>> is_date("1990-12-1")
True

>>> is_date("foo 1990-12-1 bar")
True

>>> is_date("foo 1990-13-1 bar")
False

>>> is_date('Book by appt. for Dec. 31, 2019')
True  # Both 'Dec.' and '2019' successfully parse to a date.

# But be wary of false positives.
>>> is_date('I had 2019 hits on my website today')
True  
Alexander
  • 105,104
  • 32
  • 201
  • 196
1

You could use a simple regex pattern

import re
def is_date(regex, str):
    return bool(re.match(regex, s))

regex = r'.*? \d{4}-\d\d?-\d\d? .*?'

>>> is_date(regex, "foo bar")
False
>>> is_date(regex, "1990-12-1")
True
>>> is_date(regex, "foo 1990-12-1 bar")
True

This will match any date in the format " ####-#[#]-#[#] " where the # in square brackets is optional. You can modify this regex pattern to suite your needs.

more about regex

grizzasd
  • 363
  • 3
  • 15
0

One possibility is to check all possible (continuous) substrings of the original string. That solution has horrible performance (N^2 calls to OP's is_date), but it does not rely on heuristics to split tokens in the string or regexp definitions: by definition, it matches iff is_date would have matched a substring.

def get_all_substrings(input_string):
    # From https://stackoverflow.com/questions/22469997/how-to-get-all-the-contiguous-substrings-of-a-string-in-python
    # could be made a generator to save space, but we are not making a performant solution anyway
    length = len(input_string)
    return [input_string[i:j+1] for i in xrange(length) for j in xrange(i,length)]

def contains_date(string):
    for substring in get_all_substrings(string):
        if is_date(substring): return True
    return False
Leporello
  • 638
  • 4
  • 12