I need to build a web service that analyzises SEO. The service will show how often the site was updated. I need to figure out how to get the posted date or update frequency from the HTML of the website.
For example on http://googletesting.blogspot.com/ I can get date from the tag <span>Wednesday, June 04, 2014</span>
. Other websites don't use the same tags and date format so I can't us the same code to detect those dates.
(Dates can have very different formats in different locales. Also, month names can be written as text or as number. I need to match as much dates as possible.Sometime,date format isn't posted date but it's just words in articles.
My Algorithm about this I attempt to get "posted date" from all posted then calculate update frequency. Such as Fist posted at 30May 2012, Second posted at 29May2012, Third posted at 28May2012 So I will get result that this website was updated dairly
In the end, I want to know if each website updates:
- Yearly
- Monthly
- Weekly
- Daily
How do I reliably get this from any website?