I'm working on a program that is running a series of regexs to attempt to find a date within the DOM from a webpage. For example, in www.engadget.com/2010/07/19/windows-phone-7-in-depth-preview/, I would match "Jul 19th 2010" with my regex. Things were going fine in multiple formats and languages until I hit an Arabic webpage. As an example, consider http://islammaktoob.maktoobblog.com/. The date July 18, 2010 appears in Arabic at the top of the post, but I can't figure out how to match it. Does anyone have any experience on matching Arabic dates? If someone could post an example or the regex they would use to match that Arabic date, it would be very helpful. Thank you!
Update:
Getting closer:
String fromTheSite = "كتبها اسلام مكتوب ، في 18 تموز 2010 الساعة: 09:42 ص";
NamedMatcher infoMatcher = NamedPattern.compile("(?<Day>[0-3]?[0-9]) (?<Month>يناير|فبراير|مارس|أبريل|إبريل|مايو|يونيو|يونيه|يوليو|يوليه|أغسطس|سبتمبر|أكتوبر|نوفمبر|ديسمبر|كانون الثاني|شباط|آذار|نيسان|أيار|حزيران|تموز|آب|أيلول|تشرين الأول|تشرين الثاني|كانون الأول) (?<Year>[1-2][0-9][0-9][0-9]) ", Pattern.CANON_EQ).matcher(fromTheSite);
while(infoMatcher.find()){
System.out.println(infoMatcher.group());
System.out.println(infoMatcher.group("Day"));
System.out.println(infoMatcher.group("Month"));
System.out.println(infoMatcher.group("Year"));
}
Gives me
18 تموز 2010
18
تموز
2010
Why does the match appear out of order?