1

I googled and couldn't find any could that would compare a webpage to a previous version.

In this case the page I'm trying to watch is link text. There are services that can watch a page, but I'd like to set this up on my own server.

I've set this up as a wiki so anyone can add to the code. Here's my idea

  1. Check if previous version of file exists. If false then download page
  2. If page exists, diff to find differences and email the new content along with dates of new and old versions.

This script would be called nightly via cron or on-demand via the browser (the latter is not a priority)

Sounds simple, maybe I'm just not looking in the right place.

shaiss
  • 2,000
  • 5
  • 22
  • 33
  • There are a couple of things that may get you pointed in the right direction: http://www.diffbot.com and a thick application http://www.changedetect.com The latter does allow you to generate emails of differences. Not sure if this is the full solution, but – Keith Adler Sep 29 '09 at 19:40
  • I signed up for both those services, we'll see how they work out. But again, would really be nicer to have a simple script one can put on a webserver and schedule via cron. – shaiss Sep 30 '09 at 16:46

2 Answers2

3

Perhaps a simple sh-script like this, featuring wget, diff & test?

#!/bin/sh

WWWURI="http://foo.bar/testfile.html"
LOCALCOPY="testfile.html"
TMPFILE="tmpfile"
WEBFILE="changed.html"

MAILADDRESS="$(whoami)"
SUBJECT_NEWFILE="$LOCALCOPY is new"
BODY_NEWFILE="first version of $LOCALCOPY loaded"
SUBJECT_CHANGEDFILE="$LOCALCOPY updated"
SUBJECT_NOTCHANGED="$LOCALCOPY not updated"
BODY_CHANGEDFILE="new version of $LOCALCOPY"

# test for old file
if [ -e "$LOCALCOPY" ]
then
    mv "$LOCALCOPY" "$LOCALCOPY.bak"
    wget "$WWWURI" -O"$LOCALCOPY" -o/dev/null
    diff "$LOCALCOPY" "$LOCALCOPY.bak" > $TMPFILE

# test for update
    if [ -s "$TMPFILE" ]
    then
        echo "$SUBJECT_CHANGEDFILE"
        ( echo "$BODY_CHANGEDFILE" ; cat "$TMPFILE" ) | tee "$WEBFILE" | mail -s "$SUBJECT_CHANGEDFILE" "$MAILADDRESS"
    else
        echo "$SUBJECT_NOTCHANGED"
    fi
else
    wget "$WWWURI" -O"$LOCALCOPY" -o/dev/null
    echo "$BODY_NEWFILE"
    echo "$BODY_NEWFILE" | tee "$WEBFILE" | mail -s "$SUBJECT_NEWFILE" "$MAILADDRESS"
fi
[ -e "$TMPFILE" ] && rm "$TMPFILE"

Update: Pipe through tee, little spelling & remove of $TMPFILE

osti
  • 417
  • 2
  • 8
  • great script, I've set that out on my webserver and will post back shortly with the results – shaiss Sep 30 '09 at 16:52
  • script works like a charm, however I still believe the ideal solution would be a web language that provide access via the browser – shaiss Sep 30 '09 at 18:58
  • The tee-Pipe will write the diff to a file (and afterwards, pipe it to mail). For a more sophisticated version, you probably want to switch to PHP or similar things :) – osti Oct 03 '09 at 11:25
0

You can check This SO posting to get a few ideas and also information about the challenge of detecting "true" changes to a web page (with fluctuating advertisement block, and other "noise")

Community
  • 1
  • 1
mjv
  • 73,152
  • 14
  • 113
  • 156
  • valid posts, however I'm not looking to fingerprint, as in this case its one site with minor changes happening weekly. so even if the change is minor would still be nice to see it. – shaiss Sep 30 '09 at 17:00