Comparing HTML with difflib

Question

I'm looking to get reliable diffs of content only (structural changes will be rare and therefore can be ignored) of this page. More specifically, the only change I need to pick up is a new Instruction ID added:

To get a feel for what difflib will produce, I first diff two identical HTML contents, hoping to get nothing back:

url = 'https://secure.ssa.gov/apps10/reference.nsf/instructiontypecode!openview&restricttocategory=POMT'
response = urllib.urlopen(url
content = response.read()
import difflib
d = difflib.Differ()

diffed = d.compare(content, content)

Since difflib mimics the UNIX diff utility, I would expect diffed to contain nothing (or give some indication that the sequences were identical, yet yet if I '\n'.join diffed, I get something resembling HTML, (although it doesn't render in a browser)

Indeed, if I take the simplest case possible of diffing two characters:

diffed = d.compare('a', 'a')

diffed.next() produces the following:

'  a'

So I am either expecting something from difflib that it can't or won't provide (and I should change tack), or am I misusing it? What are viable alternatives for diffing HTML?

score 4 · Accepted Answer · edited Oct 28 '20 at 11:42

4

The arguments to Differ.compare() are supposed to be sequences of strings. If you use two strings they will be each treated as sequence and therefore compared character by character.

So your example should be rewritten as:

url = 'https://secure.ssa.gov/apps10/reference.nsf/instructiontypecode!openview&restricttocategory=POMT'
response = urllib.urlopen(url)
content = response.readlines()  # get response as list of lines
import difflib
d = difflib.Differ()

diffed = d.compare(content, content)
print('\n'.join(diffed))

If you only want to compare the content of a html file, you should probably use a parser to process it and get only text without tags, e.g. by using BeautifulSoup's soup.stripped_strings:

soup = bs4.BeautifulSoup(html_content)
diff = d.compare(list(soup.stripped_strings), list_to_compare_to)
print('\n'.join(diff))

edited Oct 28 '20 at 11:42

M. Ka

93
7

answered Feb 11 '16 at 21:39

mata

67,110
10
163
162

Odd: (i) when I perform `d.compare(content, content)`, where `content` is now the output of `.readlines()` as opposed to `.read()`, the output is still the full HTML document, despite identical content, albeit with each line separated by a newline (ii) Likewise with your suggested bs4 approach - when I compare `list(soup.stripped_strings)` with `list(soup.stripped_strings)`, the output is still the full HTML doc (with tags removed). What am I misunderstanding here? – Pyderman Feb 12 '16 at 23:37
That is how a [Differ](https://docs.python.org/3/library/difflib.html#difflib.Differ) works, it will return the whole document, if there are changes they will be prefixed with `-` or `+` to indicate lines removed from input a and added to input b. So if both inputs are equal, you'll just not get any lines with such prefixes. – mata Feb 13 '16 at 09:25
I see. Thanks. I've put this knowledge to use here. I welcome your thoughts on it: http://stackoverflow.com/q/35375004/1389110 – Pyderman Feb 13 '16 at 14:04

Comparing HTML with difflib

1 Answers1