Get HTML between 2 elements using libXML2

Question

I have to extract HTML between two elements. e.g.

<html>
<head> </head>
<body>
<div>
<span id="start">                
<span> Some text </span>
<span> some other text</span> 
</span>
<span id="parent">                
<span id="target"> target node </span>
<span> some other text</span> 
</span>
</div>
</body>
</html>

Now I want to extract HTML content starting from span with id "start" to span with id "target".

result:

<span id="start">                
<span> Some text </span>
<span> some other text</span> 
</span>
<span>                
<span id="target"> target node </span>

I was able to extract the HTML using :

I am using tree parsing method.

htmlDocPtr xhtmlDoc = htmlReadFile(fileName.c_str(), "UTF-8",            HTML_PARSE_RECOVER|HTML_PARSE_NOERROR|HTML_PARSE_NOWARNING);

htmlNodePtr rootNodePtr = xmlDocGetRootElement(xhtmlDoc);

Then I parsed to the required node and then I used:

xmlBufferPtr nodeBuffer = xmlBufferCreate();
xmlNodeDump(nodeBuffer, xhtmlDoc, cur_node, 0, 1);
printf("%s\n",nodeBuffer->content);

Note: cur_node is of type xmlNode *

But the problem is when I reach the span with id "parent" and extracts the data it given whole HTML content and I get:

<span id="start">                
<span> Some text </span>
<span> some other text</span> 
</span>
<span id="parent">                
<span id="target"> target node </span>
<span> some other text</span> 
</span>

means extra content. How I can achieve the intended result?

We need to see more code. And no, it doesn't give "whole HTML content"; that's simply printing the `` element, which *contains* your target element and other elements. — Nicol Bolas, Sep 12 '12 at 05:04
You just have to walk the tree in document order printing stuff as you go, until you hit the place you want to stop. This means a recursive function. Is that the bit you are struggling with? — john, Sep 12 '12 at 05:37
@john I am doing that. But xmlNodeDump or for that matter any dump functions provided by libXML prints children also. I do not want all children as shown in e.g. above I just want some children and also I can not skip the parent. — Any, Sep 12 '12 at 07:20
OK so don't use the dump functions. Just get the element names, attribute names, attribute values, text node values or whatever and print them out. — john, Sep 12 '12 at 07:37
You are correct John. I have already started doing that and code is almost complete. I will definitely share it for others. — Any, Sep 13 '12 at 07:04

Get HTML between 2 elements using libXML2

0 Answers0