1

I'm learning about libexpat. I cobbled together this example for basic familiarity using the API:

The Code:

#include <stdio.h>
#include <expat.h>
#include <string.h>
#include <iostream>

void start(void* userData, const char* name, const char* argv[])
{
  std::cout << "name: " << name << std::endl;

  int i = 0;

  while (argv[i])
  {
    std::cout << "argv[" << i << "] == " << argv[i++] << std::endl;
  }
}

void end(void* userData, const char* name)
{
}

void value(void* userData, const char* val, int len)
{
  char str[len+1];
  strncpy(str, val, len);
  str[len] = '\0';

  std::cout << "value: " << str << std::endl;
}

int main(int argc, char* argv[], char* envz[])
{
  XML_Parser parser = XML_ParserCreate(NULL);
  XML_SetElementHandler(parser, start, end);
  XML_SetCharacterDataHandler(parser, value);

  int bytesRead = 0;
  char val[1024] = {};
  FILE* fp = fopen("./catalog.xml", "r");
  std::cout << "fp == 0x" << (void*)fp << std::endl;

  do
  {
    bytesRead = fread(val, 1, sizeof(val), fp);
    std::cout << "In while loop bytesRead==" << bytesRead << std::endl;

    if (0 == XML_Parse(parser, val, bytesRead, (bytesRead < sizeof(val))))
    {
      break;
    }
  }
  while (1);

  XML_ParserFree(parser);
  std::cout << __FUNCTION__ << " end" << std::endl;

  return 0;
}

catalog.xml:

<CATALOG>
    <CD key1="value1" key2="value2">
        <TITLE>Empire Burlesque</TITLE>
        <ARTIST>Bob Dylan</ARTIST>
        <YEAR>1995</YEAR>
    </CD>
</CATALOG>

Makefile:

xml: xml.o
        g++ xml.o -lexpat -o xml

xml.o: main.cpp Makefile
        g++ -g -c main.cpp -o xml.o

Output:

fp == 0x0x22beb50
In while loop bytesRead==148
name: CATALOG
value: 

value:     
name: CD
argv[1] == key1
argv[2] == value1
argv[3] == key2
argv[4] == value2
value: 

value: 
name: TITLE
value: Empire Burlesque
value: 

value: 
name: ARTIST
value: Bob Dylan
value: 

value: 
name: YEAR
value: 1995
value: 

value:     
value: 

In while loop bytesRead==0
main end

Question:

From the output, it appears that the callback I installed with XML_SetCharacterDataHandler() gets called twice for the CATALOG,, CD, TITLE, and ARTIST xml tags, and then multiple times for the YEAR tag - can someone explain this behavior? From the noted catalog.xml, it's not clear to me why there are (or would ever be) multiple values associated with any XML tags.

Thank you.

Citation:

Credit to this site for the basis of the above sample code.

StoneThrow
  • 5,314
  • 4
  • 44
  • 86

1 Answers1

2

The expat parser may split text nodes into multiple calls to the character data handler. To properly handle text nodes you must accumulate text over multiple calls and process it when receiving the "end" event for the containing tag.

This is true in general, even across different parsers and different languages -- i.e. the same thing is true in Java.

See for instance http://marcomaggi.github.io/docs/expat.html#using-comm

A common first–time mistake with any of the event–oriented interfaces to an XML parser is to expect all the text contained in an element to be reported by a single call to the character data handler. Expat, like many other XML parsers, reports such data as a sequence of calls; there's no way to know when the end of the sequence is reached until a different callback is made.

Also from the expat documentation

A single block of contiguous text free of markup may still result in a sequence of calls to this handler. In other words, if you're searching for a pattern in the text, it may be split across calls to this handler.

Jim Garrison
  • 85,615
  • 20
  • 155
  • 190
  • Do I understand correctly: every time the `start` callback is called, I should make note of the `name` argument. E.g. suppose the start callback was called with `name=="foo"`, then the foo tag's value is actually the accumulation of all the strings passed in to the data handler callback, and that accumulation can be terminated when the end callback is called with `name=="foo"`. – StoneThrow Feb 09 '17 at 00:15
  • I think something's off about my understanding: the start callback for tag "TITLE" gets called before the end callback for "CATALOG" - so I think this means TITLE's value is accumulated until a subsequent call to the start callback with a different name...? – StoneThrow Feb 09 '17 at 00:18
  • _the start callback for tag "TITLE" gets called before the end callback for "CATALOG"_ -- That's true, the end of the `` element has not occurred yet. It won't occur until the very end of the document, when `` is encountered. Why are you expecting to see it before the start-`` event? – Jim Garrison Feb 09 '17 at 02:01
  • It makes sense that start callback for "TITLE" gets called before the end callback for "CATALOG", but then I don't understand how to put your original comment into practice: "the CATALOG's tag's value is actually the accumulation of all the strings passed in to the data handler callback, and that accumulation can be terminated when the end callback is called with name=="CATALOG"". But how would the data handler know if the `val` argument it is passed belong to CATALOG or TITLE if we received the start callback for TITLE but not yet the end callback for CATALOG? – StoneThrow Feb 09 '17 at 02:22
  • Sorry, I quoted myself instead of you. Your quote I don't understand how to put into practice: "you must accumulate text over multiple calls and process it when receiving the "end" event for the containing tag". But same concern: how would the data handler know if the val argument it is passed belong to CATALOG or TITLE if we received the start callback for TITLE but not yet the end callback for CATALOG? – StoneThrow Feb 09 '17 at 02:31
  • You know, I think I'm realizing what you mean (still happy to receive your further comment, if any): in the data handler callback, we would need to keep track of which tag's "context" we're inside, by doing something along the lines of reference counting between start tag callbacks we get and end tag callbacks we get. If we use that knowledge of "context", we know which tag the `val` argument passed to the data handler callback belongs to. Is that sort of the idea? – StoneThrow Feb 09 '17 at 02:35
  • Think "stack" to track where you are. – Jim Garrison Feb 09 '17 at 02:39
  • Gotcha. I'm with you now. Thanks for the answer and subsequent explanation. – StoneThrow Feb 09 '17 at 02:40