I would recommend processing them for generalization (basically remove the numbers and names and make them place holders) then group by similar format so you have a sample group to work with.
For example, 20th, 21st August 1987
then becomes [number][postfix], [number][postfix] [month] [year]
(given that a <number><st|th|rd|nd>
is recognized as number and postfix and months are obvious, and years are 4-digit numerics).
From there, you find out how many follow that pattern, and then find how many unique patterns you need to match. Then you can at least have a sample to test any kind of algorithm you wish to use at it (regex is probably going to be your best bet since it can detect repeated patterns (#th[, $th[, ...]]
) and day names.)
It appears you probably want to break it down by pattern (given what you've provided). So, for instance first break out yearly information:
(.*?)([0-9]{4})(?:, |$)
Then you need to break it down in to months
(.*?)(January|February|...)(?:, |$)
Then you want days contained within that month:
(?:([0-9]{1,2})(?:st|nd|rd|th)(?:, )?)*(?:, |$)
Then it's about compiling the information. But again, that's just using what you have in front of me. Ultimately you need to know what kind of data you're working with and how you want to tackle it.
Updated
So, i couldn't help but try to tackle this on my own. I wanted to prive that the method I was using was some-what accurate and I wasn't blowing smoke up your skirt. Having said that, this is what I have come up with. Note that this is in PHP for a couple of reasons:
- PHP was easier to get my hands on to
- I felt that if this was a viable solution, you should have to work at porting it over. :grin:
Anyways, here's the source and demo output. Enjoy.
<?php
$samples = array(
'20th, 21st August 1897',
'31st May, 1st June 1909',
'29th January 2007',
'10th, 11th, 12th May 1954',
'26th, 27th, 28th, 29th, 30th March 2006',
'27th, 28th, 29th, 30th November, 1st December 2006',
'30th, 31st, December 2010, 1st, 2nd January 2011'
);
//header('Content-Type: text/plain');
$months = array('january','february','march','april','may','june','july','august','september','october','november','december');
foreach ($samples as $sample)
{
$dates = array();
// find yearly information first
$yearly = null;
if (preg_match_all('/(?:^|\s)(?<month>.*?)\s?(?<year>[0-9]{4})(?:$|,)/',$sample,$yearly))
{//var_dump($yearly);
for ($y = 0; $y < count($yearly[0]); $y++)
{
$year = $yearly['year'][$y];
//echo "year: {$year}\r\n";
$monthly = null;
if (preg_match_all('/(?<days>(?:(?:^|\s)[0-9]{1,2}(?:st|nd|rd|th),?)*)\s?(?<month>'.implode('|',$months).')$/i',$yearly['month'][$y],$monthly))
{//var_dump($monthly);
for ($m = 0; $m < count($monthly[0]); $m++)
{
$month = $monthly['month'][$m];
//echo "month: {$month}\r\n";
$daily = null;
if (preg_match_all('/(?:^|\s)(?<day>[0-9]{1,2})(?:st|nd|rd|th)(?:,|$)/i',$monthly['days'][$m],$daily))
{//var_dump($daily);
for ($d = 0; $d < count($daily[0]); $d++)
{
$day = $daily['day'][$d];
//echo "day: {$day}\r\n";
$dates[] = sprintf("%d-%d-%d", array_search(strtolower($month),$months)+1, $day, $year);
}
}
}
}
$data = $yearly[1];
}
}
echo "<p><b>{$sample}</b> was parsed to include:</p><ul>\r\n";
foreach ($dates as $date)
echo "<li>{$date}</li>\r\n";
echo "</ul>\r\n";
}
?>
20th, 21st August 1897 was parsed to include:
31st May, 1st June 1909 was parsed to include:
29th January 2007 was parsed to include:
10th, 11th, 12th May 1954 was parsed to include:
- 5-10-1954
- 5-11-1954
- 5-12-1954
26th, 27th, 28th, 29th, 30th March 2006 was parsed to include:
- 3-26-2006
- 3-27-2006
- 3-28-2006
- 3-29-2006
- 3-30-2006
27th, 28th, 29th, 30th November, 1st December 2006 was parsed to include:
30th, 31st, December 2010, 1st, 2nd January 2011 was parsed to include:
- 12-30-2010
- 12-31-2010
- 1-1-2011
- 1-2-2011
And to prove there's nothing up my sleeve, http://www.ideone.com/GGMaH