3

I want a class something like this:

public interface IDateRecognizer
{
    DateTime[] Recognize(string s);
}

The dates might exist anywhere in the string and might be any format. For now, I could limit to U.S. culture formats. The dates would not be delimited in any way. They might have arbitrary amounts of whitespace between parts of the date. The ideas I have are:

  • ANTLR
  • Regex
  • Hand rolled

I have never used ANTLR, so I would be learning from scratch. I wonder if there are libraries or code samples out there that do something similar that could jump start me. Is ANTLR too heavy for such a narrow use?

I have used Regex a lot before, but I hate it for all the reasons that most people hate it.

I could certainly hand roll it but I'd rather not re-solve a solved problem.

Suggestions?

UPDATE: Here is an example. Given this input:

This is a date 11/3/63. Here is another one: November 03, 1963; and another one Nov 03, 63 and some more (11/03/1963). The dates could be in any U.S. format. They might have dashes like 11-2-1963 or weird extra whitespace inside like this: Nov   3,   1963, and even maybe the comma is missing like [Nov 3 63] but that's an edge case.

The output should be an array of seven DateTimes. Each date would be the same: 11/03/1963 00:00:00.

UPDATE: I totally hand rolled this, and I am happy with the result. Instead of using Regex, I ended up using DateTime.TryParse with a custom DateTimeFormatInfo, which allows you to very easily fine tune what formats are allowed and also handling of 2 digit years. Performance is quite acceptable given that this is handled async. The tricky part was tokenizing and testing sets of adjacent tokens in an efficient way.

doppelgreener
  • 4,809
  • 10
  • 46
  • 63
Tim Scott
  • 15,106
  • 9
  • 65
  • 79

3 Answers3

4

I'd go for some hand rolled solution to chop the input string into manageable size to let some Regex'es do the work. This seems like a great test to start with unit testing.

CodingBarfield
  • 3,392
  • 2
  • 27
  • 54
1

I'd suggest you to go with the regex. I'd put one regex (matching one date) into one string and multiple of them into an array. Then create the full regex in runtime. This makes the system more flexible. Depending what you need, you could consider putting the different date-regex into a (XML)file / db.

manubaum
  • 171
  • 3
0

Recognising dates seems to be a straight forward and easy task for Regex. I cannot understand why you are trying to avoid it.

ANTLR for this case where you have a very limited set of semantics is just overkill.

While performance could be a potential issue but I would really doubt if other options would give you better performance.

So I would go with Regex.

Aliostad
  • 80,612
  • 21
  • 160
  • 208
  • Any suggestions how to get started? I would like to be able to handle a string like: "This is a date 11/3/09 and another one Sept 18, 2010 and another one September 02, 99 and more dates 01/01/1966 in any U.S. format Jan 33, 2010 with weird extra whitespace inside, and even maybe if the comma is missing like Oct 3 99". So I'd like that to return DateTime[] with 6 dates. – Tim Scott Mar 07 '11 at 14:08
  • Update your question and bring a list of dates you have in mind and we will suggest Regex patterns. It is obvious that any requirement for handling variants need to be defined so that it could be added to regex. – Aliostad Mar 07 '11 at 14:11