1

I'm looking to monitor changes on websites and my current approach is being defeated by a rotating top banner. Is there a UNIX tool that takes a selection parameter (id attribute or XPath), reads HTML from stdin and prints to stdout the subtree based on the selection?

For example, given an html document I want to filter out everything but the subtree of the element with id="content". Basically, I'm looking for the simplest HTML/XML equivalent to grep.

jldugger
  • 14,342
  • 20
  • 77
  • 129

2 Answers2

2

Possibly not what you're looking for, but how about writing a quick script in Python, using BeautifulSoup to process the HTML, and give you a sensibly structured object which you can access the content.

import urllib2
from BeautifulSoup import BeautifulSoup as BS
soup = BS(urllib2.urlopen('http://www.google.com').read())
soup.findAll('a')[1]
#returns
<a onclick="gbar.qs(this)" href="http://video.google.co.uk/?hl=en&amp;tab=wv" class="gb1">Videos</a>
Tom O'Connor
  • 27,480
  • 10
  • 73
  • 148
  • I was thinking about using Beautiful Soup but figured someone had recognized the problem and make a reusable component out of it already. Thanks for the example though. – jldugger May 05 '10 at 23:16
  • I bet nobody has. If you can be arsed to make it generic enough, i'd suggest you bung the project on googlecode or similar. – Tom O'Connor May 05 '10 at 23:54
  • This is the strategy I went with. There's a lot of XML tools, but they don't cope with poorly written HTML. I've generalized the code a bit, but need to spend a bit more time UNIX styling it before publishing. – jldugger May 06 '10 at 21:04
  • :) Cool. Glad it was useful! – Tom O'Connor May 06 '10 at 22:59
1

write a Perl script with LWP and HTML::TreeBuilder::XPath perhaps.

xenoterracide
  • 1,496
  • 2
  • 13
  • 26
  • Effectively the same as my suggestion. I find BeautifulSoup to be somewhat more semantic than Perl's offering. Depends what the OP prefers I guess! – Tom O'Connor May 05 '10 at 23:12