0

I'm new to libxml and so far everything is good, but I noticed one thing that annoys me: When libxml reports characters, i.e. the handler's characters function is being called, "special" characters like ' or " or reported individually. example:

"It's a nice day today. Don't you agree?"
report:"
report: It
report: '
report: s a nice day today. Don
report: '
report: you aggree?
report: "

Is there any way to change that behavior, so it would be reported as a complete string? Don't get me wrong, it's not a problem to use strcat to put the original string together, but that's additional work ;)

I searched the headers and the net and found no solution. Thank you in advance.

Edit: Because the handler description above needs some more explaining. By reporting characters I mean when the handler's (htmlSAXHandler) handler.characters callback function is called, which I assigned:

void _characters(void *context, const xmlChar *ch, int len) {
    printf("report: %s\n", chars);
}
YllierDev
  • 571
  • 4
  • 16

2 Answers2

1

You might want to look at DOM parsing instead of registering SAX callbacks, if your document isn't going to be so large that you can't hold it all in memory.

#include <stdio.h>
#include <libxml/HTMLparser.h>
#include <libxml/tree.h>

int main()
{
  htmlDocPtr doc;
  xmlNodePtr root, node;
  char *output;
  char *rawhtml = "<html><body>\"It's a nice day today.  Don't you agree?\"</body></html>";
  doc = htmlReadDoc(rawhtml, NULL, NULL, XML_PARSE_NOBLANKS);
  root = xmlDocGetRootElement(doc);
  node = root->children;
  output = xmlNodeGetContent(node);
  printf("output=[%s]\n", output);
  if(output)
    xmlFree(output);
  if(doc)
    xmlFreeDoc(doc);
}

produces

output=["It's a nice day today.  Don't you agree?"]
Jason Viers
  • 1,722
  • 12
  • 20
0

I'm afraid you should live with that. If you encounter an HTML document with 100K chars do you also expect it to deliver all chars in one go? I think you should just be ready for splitting the characters at any moment. Then splitting them at special characters makes no difference.

This answer is not adequate if your software aims to read only small HTML documents, but I bet that the libxml authors were not thinking of special handling for such cases.

Jarekczek
  • 7,456
  • 3
  • 46
  • 66
  • I know of the 1k char limit, however, this is not the case here... well, I guess I'll have to live with strcat :D – YllierDev Nov 01 '12 at 14:03