0

I am trying to set up a bash script to download a web page once a day, then run a diff of the last two pages and send an alert if the pages are more than 15% different. I'm not really sure how to approach the selection of the two most recent pages.

The script starts simple enough, just doing a wget of a page and inserting the date into the filename:

wget --output-document=index`date +%Y-%m-%d`.html https://www.example.com

Assuming a couple of those pages have been collected, we run a diff of the two most recent pages. (And this is where I'm lost)

sdiff -B -b -s index1.html index2.html | wc -l

Any suggestions on how to set this up so it can pull the last two files and run the diff?

Ikarian
  • 51
  • 2
  • 2
  • 4
  • Save the timestamp of the last run in a file or with a link? – Etan Reisner Nov 11 '15 at 17:32
  • 1
    There are many ways. I might opt to save the file with the first characters as `YYYYMMDD`. Then you'll be able to order them easily in chronological order with `ls`. From there, it should be easy to pull the two most recently downloaded files, no? – Marc Nov 11 '15 at 17:34
  • Marc - You are correct, but I'm trying to find out if there is a more code-friendly method, like an interative function that can look at the file modified date, or something like that. "sdiff index%age+0.html index%age+1.html" Something that doesn't require the explicit filename, obviously, but will read the filenames and pull the most recent (or I suppose, the largest sum of D+M+Y would work too, right?) – Ikarian Nov 11 '15 at 19:48

1 Answers1

0

I would keep the date as part of file name when you do wget.

For file comparison, I would go by below solution.

YdayFile=index`date +%Y%m%d -d "1 day ago"`.html
TodaysFile=index`date +%Y%m%d`.html        
wget --output-document=${TodaysFile} https://www.example.com
sdiff -B -b -s ${TodaysFile} ${YdayFile} | wc -l

You could replace "1 day ago" any number of days you want to go back. Doing file existence check before diff would be nice too.

Check out this link for more date operations. http://ss64.com/

cyber.sh
  • 712
  • 4
  • 10