How to get content in a header using XPath

Question

I'm extracting content from a web page using Yahoo Pipes. For some reason, the developer placed the article content within <h2> tags and I'm having difficulty getting the content from there.

The content looks like this:

<div id="divid"><h2>
<p>Some content<p>
<p>Some more content</p>
</h2>
<!-- some more stuff here -->
</div>

When I use //div[@id='divid'] I can fetch the content of the whole <div> block, but when I try //div[@id='divid']//h2 or //div[@id='divid']//h2/text() I get nothing.

What am I doing wrong and how can I get the content between the <h2> tags correctly?

You may want to check the actual web page.

score 1 · Accepted Answer · answered Sep 13 '13 at 14:04

1

Maybe what you were missing is ticking the Use HTML5 parser option. Without that it could not match //h2.

That page is quite a piece of work. The text is full of <span...> tags with inline styles. I created a sample pipe to make some sense out of the page:

http://pipes.yahoo.com/pipes/pipe.info?_id=cf46006f77bdac4a6e57785c78cd0b2b

answered Sep 13 '13 at 14:04

janos

120,954
29
226
236

Yes, it was the HTML5 parser option. Thanks! – some user Sep 13 '13 at 14:05

How to get content in a header using XPath

1 Answers1