3

I'm currently trying to write down a concept how I could solve following thing:

In Java I'm currently scraping a web-page with articles. If any of these articles get available or change somehow it should give me an alert. The scraping of all the articles is pretty fast. But I actually also want to implement that it also checks every article itself if like a size of an article got available. But since this is just another page open for every article it will take much longer. (This is actually another problem which I need to solve)

Now my current question is, how could I found out if something has changed? Would it be a good aproach to save every article in a database. On the next check it would save all the articles in another database and compare them? Like saving all every article in a bean and then in an array and compare the two arrays if they are different?

I currently don't really see a light at the end of the tunnel how I could solve that in a good and beautiful way.

Every comment would be appreciated. thanks in advance

pythoniosIV
  • 237
  • 5
  • 18
  • 2
    You should look into using a hash such as md5 – Drew Galbraith Jul 23 '14 at 19:57
  • 1
    If these are indeed articles, don't they have a "last updated" timestamp in their subtitle? If so, you're better off comparing timestamps, than scraping and saving entire contents or hashes thereof. – Traveling Tech Guy Jul 23 '14 at 19:59
  • @DrewGalbraith never thought of that, great! But still I would need somehow to persist the data and then compare it on the next "scrap" – pythoniosIV Jul 23 '14 at 20:01
  • @TravelingTechGuy Well they are just with the
    -Tag in the html and don't really have a last updated timestamp.
    – pythoniosIV Jul 23 '14 at 20:01
  • 1
    Like you said a database would be the best for that. All you need to do is have entries with the url to an article and any other information you want with it as well as its most recent hash. Then the next time you scrape it you can compare the new hash to the one in the database and if they're different than you know its been updated. – Drew Galbraith Jul 23 '14 at 20:02

0 Answers0