RapidXML: "expected <" error at end-of-file related to whitespace bug?

Question

I created a C++ application that reads in XML files with the RapidXML parser. At one XML file that was shaped exactly the same as another one that worked, the parser threw an error:

"expected <"

The last five characters before the error were from the closing tag of the root element, so the error happened at the end-of-file:

</UW>

What I suspect this error to be related to, is a whitespace skipping bug being an issue with RapidXML v1.12 (I am using v1.13). I used no parsing flags (doc.parse<0>(bfr);).

According to this site, the bug was believed to be caused by faulty implementation of the "parse_trim_whitespace" parse flag. A patch was provided on that site, but there also seemed to be a problem with that patch.

The following is the XML document that caused this error. What I also don't understand - besides the reason for the error - is why the error didn't happen parsing another file with content of the same fashion. My application also successfully parses several other files before that file.

<?xml version="1.0" encoding="UTF-8"?>
<UW>
    <Bez>EV005</Bez>
    <Herst>Trumpf</Herst>
    <Gesw>16</Gesw>
    <Rad>1.6</Rad>
    <Hoehe>100</Hoehe>
    <Wkl>30</Wkl>
    <BgVerf>Freibiegen</BgVerf>
    <MaxBel>50</MaxBel>
    <Kontur>0</Kontur>
    <Grafik>0</Grafik>
</UW>

Part of my application were the error occours (this is the inside of a loop):

    // Get "Bezeichnung" attribute
    attr = subnode->first_attribute("Bezeichnung");
    if ( !attr ){   err(ERR_FILE_INVALID,"Werkzeuge.xml");  return 0; }
    name = attr->value();
    // Get file name/URL
    string fileName = name;
    fileName.append(".xml");
    // Open file
    ifstream werkzeugFile(concatURL(PFAD_WERKZEUGE,fileName));
    if(!werkzeugFile.is_open()) {   err(ERR_FILE_NOTFOUND,fileName);    return 0;   }
    // Get length
    werkzeugFile.seekg(0,werkzeugFile.end);
    int len = werkzeugFile.tellg();
    werkzeugFile.seekg(0,werkzeugFile.beg);
    // Allocate buffer
    char * bfr = new char [len+1];
    werkzeugFile.read(bfr,len);
    werkzeugFile.close();
    // Parse
    SetWindowText(hwndProgress,"Parsing data: Werkzeuge/*.xml");
    btmDoc.parse<0>(bfr);

    // Get type of tool & check validity
    xml_node<> *rt_node = btmDoc.first_node();
    if ( strcmp(rt_node->name(),"OW") == 0 ){
        isOW = true;
    }
    else if ( strcmp(rt_node->name(),"UW") == 0 ){
        isUW = true;
    }
    else {  err(ERR_FILE_INVALID,fileName); return 0;   }

    // Prepare for next loop iteration
    delete[] bfr;
    btmDoc.clear();
    subnode = subnode->next_sibling();

Open the files in a decent text editor and see (or, perhaps better, use a hex editor). For example, with Notepad++, the encoding (UTF-8, ANSI, ANSI as UTF-8, etc.) is displayed in the bottom right, on the status bar. Files with multi-byte encodings usually have a BOM, except UTF-8 files, which can but most often don't (because it's impossible to mistake the byte order in UTF-8), and many parsers suck with regards to non-ASCII text and don't check handle all the possible cases. — Cameron, May 26 '14 at 19:23
Yeah, it says "ANSI as UTF-8". The RapidXML manual says: "UTF-8 is fully supported, including all numeric character references, which are expanded into appropriate UTF-8 byte sequences". So that doesn't seem to be the problem. — Sam, May 26 '14 at 19:32
Well, one more thing ruled out. Both files are bytewise identical, then? — Cameron, May 26 '14 at 19:38
Yes, the headers and endings are the same. Another thing: I use a dynamically allocated char array as the buffer, but that variable is inside the loop that goes through the files and I delete[] it after every iteration (file parsing). So that shouldn't be a problem either, since it's local scope, is it? — Sam, May 26 '14 at 19:47
Added the part. This is about press brake tools, "Werkzeug" = tool in German, "U/O" = "unter-/ober-" = "upper/lower". — Sam, May 26 '14 at 19:57

Cameron · Accepted Answer · 2014-05-26T20:13:18.097

1

Ah, I think I see it. Two things:

First, the ifstream is suspicious -- shouldn't it be opened in binary mode if you're jumping around in it using byte offsets (and somebody else is doing the parsing)? Passstd::ios::in | std::ios::binary as the second argument to the ifstream constructor.

Second, your memory management seems fine, except that you allocate one byte extra (the +1) but never seem to make use of it. I'm assuming you're missing bfr[len] = '\0'; after the contents are read in -- this explains the odd parse error at the end of the file, since the XML parser doesn't know it reached the end of the file -- it's parsing a null terminated string that isn't null terminated, and tries to parse random bytes of memory ;-)

edited May 26 '14 at 20:13

answered May 26 '14 at 20:04

Cameron

96,106
25
196
225

Didn't you mean "bfr[len] = '\0';"? Since "bfr" is the buffer that the parser gets in contact with. Also, why would all of this not work for this file, while it works for several others? – Sam May 26 '14 at 20:10
Whoops, yes I do. And it would work some of the time, because the parser would stop at the first null byte it sees -- and generally that's not too far off (any integer variable with a smallish value, for example, will contain a zero byte). – Cameron May 26 '14 at 20:13
YEAH - did it! I changed the buffer allocating to "char * bfr = new char [len];" and "bfr[len] = '\0';". ifstream constructor fix done too. Now it works. – Sam May 26 '14 at 20:15
@Sam: Ah wait, don't get rid of the `+1` -- that's where you're putting the null byte. You don't want to overwrite somebody else's memory... – Cameron May 26 '14 at 20:17
Allright, so imma do all of the file parsings that way ... sweet. Thanks a lot @Cameron! – Sam May 26 '14 at 20:19
@Sam: No problem :-) It's worth being extra careful about memory, otherwise weird intermittent bugs like this will come back to haunt you. – Cameron May 26 '14 at 20:23

RapidXML: "expected <" error at end-of-file related to whitespace bug?

1 Answers1