0

I need to build a web service that analyzises SEO. The service will show how often the site was updated. I need to figure out how to get the posted date or update frequency from the HTML of the website.

For example on http://googletesting.blogspot.com/ I can get date from the tag <span>Wednesday, June 04, 2014</span>. Other websites don't use the same tags and date format so I can't us the same code to detect those dates. (Dates can have very different formats in different locales. Also, month names can be written as text or as number. I need to match as much dates as possible.Sometime,date format isn't posted date but it's just words in articles.

My Algorithm about this I attempt to get "posted date" from all posted then calculate update frequency. Such as Fist posted at 30May 2012, Second posted at 29May2012, Third posted at 28May2012 So I will get result that this website was updated dairly

In the end, I want to know if each website updates:

  • Yearly
  • Monthly
  • Weekly
  • Daily

How do I reliably get this from any website?

Fame th
  • 1,018
  • 3
  • 17
  • 37
  • If you're looking at blogs only, the RSS feeds will have what you're after – hd1 Jun 10 '14 at 06:55
  • check this link, it worked for me! https://stackoverflow.com/questions/24134670/how-to-get-the-update-frequency-of-websites – Mostafa Wael May 19 '21 at 00:42

1 Answers1

0

Instead of parsing the dates in the page, you could download the home page and store it. Then you could come back every day and download the homepage again to see if it changed. This approach would work even for sites that don't publish any dates on their homepage. It would take longer to get your answer though.


Another approach would be to download the RSS feed for the site if it has one. The example site you give one has an XML feed: http://feeds.feedburner.com/blogspot/RLXA?format=xml RSS feeds are meant to be machine readable and the dates are in a consistent format.


You also say that you are using Java. I've found that Java's date parsing libraries are not very flexible. They force you to know the exact format of the date before you parse it. I have written a free, open source flexible date time parser in Java that you could try: http://ostermiller.org/utils/DateTimeParse.html Once you found dates on the page (maybe for looking at what comes after "posted on"), you could use my flexible parser to parse dates in a variety of formats.

Stephen Ostermiller
  • 23,933
  • 14
  • 88
  • 109