UNIX tool to dump a selection of HTML?

Question

I'm looking to monitor changes on websites and my current approach is being defeated by a rotating top banner. Is there a UNIX tool that takes a selection parameter (id attribute or XPath), reads HTML from stdin and prints to stdout the subtree based on the selection?

For example, given an html document I want to filter out everything but the subtree of the element with id="content". Basically, I'm looking for the simplest HTML/XML equivalent to grep.

score 2 · Accepted Answer · answered May 05 '10 at 23:12

2

Possibly not what you're looking for, but how about writing a quick script in Python, using BeautifulSoup to process the HTML, and give you a sensibly structured object which you can access the content.

import urllib2
from BeautifulSoup import BeautifulSoup as BS
soup = BS(urllib2.urlopen('http://www.google.com').read())
soup.findAll('a')[1]
#returns
<a onclick="gbar.qs(this)" href="http://video.google.co.uk/?hl=en&amp;tab=wv" class="gb1">Videos</a>

answered May 05 '10 at 23:12

Tom O'Connor

27,480
10
73
148

I was thinking about using Beautiful Soup but figured someone had recognized the problem and make a reusable component out of it already. Thanks for the example though. – jldugger May 05 '10 at 23:16
I bet nobody has. If you can be arsed to make it generic enough, i'd suggest you bung the project on googlecode or similar. – Tom O'Connor May 05 '10 at 23:54
This is the strategy I went with. There's a lot of XML tools, but they don't cope with poorly written HTML. I've generalized the code a bit, but need to spend a bit more time UNIX styling it before publishing. – jldugger May 06 '10 at 21:04
:) Cool. Glad it was useful! – Tom O'Connor May 06 '10 at 22:59

score 1 · Answer 2 · answered May 05 '10 at 23:08

1

write a Perl script with LWP and HTML::TreeBuilder::XPath perhaps.

answered May 05 '10 at 23:08

xenoterracide

1,496
2
13
26

Effectively the same as my suggestion. I find BeautifulSoup to be somewhat more semantic than Perl's offering. Depends what the OP prefers I guess! – Tom O'Connor May 05 '10 at 23:12

UNIX tool to dump a selection of HTML?

2 Answers2