I'm looking to get reliable diffs of content only (structural changes will be rare and therefore can be ignored) of this page. More specifically, the only change I need to pick up is a new Instruction ID added:
To get a feel for what difflib will produce, I first diff two identical HTML contents, hoping to get nothing back:
url = 'https://secure.ssa.gov/apps10/reference.nsf/instructiontypecode!openview&restricttocategory=POMT'
response = urllib.urlopen(url
content = response.read()
import difflib
d = difflib.Differ()
diffed = d.compare(content, content)
Since difflib mimics the UNIX diff
utility, I would expect diffed
to contain nothing (or give some indication that the sequences were identical, yet yet if I '\n'.join
diffed
, I get something resembling HTML, (although it doesn't render in a browser)
Indeed, if I take the simplest case possible of diffing two characters:
diffed
= d.compare('a', 'a')
diffed.next()
produces the following:
' a'
So I am either expecting something from difflib that it can't or won't provide (and I should change tack), or am I misusing it? What are viable alternatives for diffing HTML?