0

I have to extract HTML between two elements. e.g.

<html>
<head> </head>
<body>
<div>
<span id="start">                
<span> Some text </span>
<span> some other text</span> 
</span>
<span id="parent">                
<span id="target"> target node </span>
<span> some other text</span> 
</span>
</div>
</body>
</html>

Now I want to extract HTML content starting from span with id "start" to span with id "target".

result:

<span id="start">                
<span> Some text </span>
<span> some other text</span> 
</span>
<span>                
<span id="target"> target node </span>

I was able to extract the HTML using :

I am using tree parsing method.

htmlDocPtr xhtmlDoc = htmlReadFile(fileName.c_str(), "UTF-8",            HTML_PARSE_RECOVER|HTML_PARSE_NOERROR|HTML_PARSE_NOWARNING);

htmlNodePtr rootNodePtr = xmlDocGetRootElement(xhtmlDoc);

Then I parsed to the required node and then I used:

xmlBufferPtr nodeBuffer = xmlBufferCreate();
xmlNodeDump(nodeBuffer, xhtmlDoc, cur_node, 0, 1);
printf("%s\n",nodeBuffer->content);

Note: cur_node is of type xmlNode *

But the problem is when I reach the span with id "parent" and extracts the data it given whole HTML content and I get:

<span id="start">                
<span> Some text </span>
<span> some other text</span> 
</span>
<span id="parent">                
<span id="target"> target node </span>
<span> some other text</span> 
</span>

means extra content. How I can achieve the intended result?

Any
  • 168
  • 3
  • 11
  • We need to see more code. And no, it doesn't give "whole HTML content"; that's simply printing the `` element, which *contains* your target element and other elements. – Nicol Bolas Sep 12 '12 at 05:04
  • You just have to walk the tree in document order printing stuff as you go, until you hit the place you want to stop. This means a recursive function. Is that the bit you are struggling with? – john Sep 12 '12 at 05:37
  • @john I am doing that. But xmlNodeDump or for that matter any dump functions provided by libXML prints children also. I do not want all children as shown in e.g. above I just want some children and also I can not skip the parent. – Any Sep 12 '12 at 07:20
  • OK so don't use the dump functions. Just get the element names, attribute names, attribute values, text node values or whatever and print them out. – john Sep 12 '12 at 07:37
  • You are correct John. I have already started doing that and code is almost complete. I will definitely share it for others. – Any Sep 13 '12 at 07:04

0 Answers0