TinyXML2 C++ - Extracting specific data from old/poorly formatted XML files

Question

I'm looking to search within blocks of XML that are rather old (documents dated 1999) and I'm having a little bit of difficulty getting TinyXML2 to operate as intended. I can grab certain snippets but I have issues when there's an element within another one. Take this sample:

  <SUBJECT><TITLE>Mathematics</TITLE></SUBJECT>
     <AREA><TITLE>Arithmetic</TITLE></AREA>
     <SECTION><TITLE>Whole Numbers</TITLE></SECTION> 
        <TOPIC GRADELEVEL="4"><TITLE>Introduction to Numbers</TITLE></TOPIC> 
          <DESCRIPTION><TITLE>Description</TITLE></DESCRIPTION>  
             <FIELDSPACE>
                <PARA>To represent each conceivable number by means of a separate
                  little picture or number symbol is impossible. Therefore the civilizations of
                  the past all developed a certain pattern whereby they could write down numbers,
                  by making use of a small number of symbols. </PARA>
             </FIELDSPACE> 
             <FIELDSPACE>
                <PARA>Today, we use the Hindu-Arabic system, which first of all is
                  decimal, because we make use of only 10 different symbols, namely,</PARA>
                <LITERALLAYOUT>     0, 1, 2, 3, 4, 5, 6, 7, 8, and 9.</LITERALLAYOUT>
             </FIELDSPACE>
             <FIELDSPACE>
                <PARA>Secondly, a place value applies. This means that if only 1
                  digit is written down then it is that number, such as a 3, a 6, or an 8.</PARA>
             </FIELDSPACE>
             <FIELDSPACE>
                <PARA>Thirdly, only the addition principle is built into our number
                  symbols.</PARA>
                <PARA>In other words,</PARA>
                <LITERALLAYOUT>     135 means 100 + 300 + 5</LITERALLAYOUT>
                <LITERALLAYOUT>     6.3 means 6 + three tenths = 6 + <EQUATION>
<INLINEGRAPHIC FILEREF="Mathematics/Arithmetic/WholeNumbers/IntroductionNumbers/eq.png" />
</EQUATION></LITERALLAYOUT>
                <LITERALLAYOUT>     and two and a quarter = <EQUATION>
<INLINEGRAPHIC FILEREF="Mathematics/Arithmetic/WholeNumbers/IntroductionNumbers/eq2.png" />
</EQUATION></LITERALLAYOUT>
                <PARA>means</PARA>
                <LITERALLAYOUT>     two plus a quarter = <EQUATION>
<INLINEGRAPHIC FILEREF="Mathematics/Arithmetic/WholeNumbers/IntroductionNumbers/eq3.png" />
</EQUATION></LITERALLAYOUT>
             </FIELDSPACE>

Here is what I've written:

    XMLDocument doc;
    Resource::resource_t *f = Resource::Open("IntroductionNumbers.xml"); // File load

        if (!f)
            return;

        doc.Parse((const char*)f->buffer, f->size);
        Resource::Close(f);

        XMLElement *pElem;
        pElem = doc.FirstChildElement();

        if (!pElem)
            return;
        for (pElem = pElem->FirstChildElement(); pElem; pElem = pElem->NextSiblingElement())
        {
            if (!strcmp(pElem->Value(), "SUBJECT"))
            {
                // Print what's in pElem->FirstChildElement("TITLE")->GetText()
                // This works fine.
            }
            else if (!strcmp(pElem->Value(), "AREA"))
            {
                // Print what's in pElem->FirstChildElement("TITLE")->GetText()
                // This works fine.
            }
...
...
...
             else if (!strcmp(pElem->Value(), "TOPIC"))
            {
                 char *temp;
                 temp = msprintf("%s - Section %s", pElem->FirstChildElement("TITLE")->GetText(), pElem->FirstAttribute()->Value());
                // Print what's in temp
                // This still works!
            }
             else if (!strcmp(pElem->Value(), "FIELDSPACE"))
            {
                // I can print PARA or FIELDSPACE, but I can't seem to read LITERALLAYOUT, EQUATION, or INLINEGRAPHIC.
            }
        }

I need code that is generic instead of code specific to this solution - there are hundreds of XML files and I need to write something that will parse all of them. How would I grab information within LITERALLAYOUT/EQUATION/INLINEGRAPHIC?

Thanks in advance!

You will do better using XPath `xmllint --xpath '//LITERALLAYOUT/EQUATION/INLINEGRAPHIC' test.xml ` — LMC, Feb 02 '18 at 15:07
This : https://stackoverflow.com/questions/43353518/tinyxml2-get-text-from-node-and-all-subnodes/43356508#43356508 might give you some ideas for iterating thru the XML hierarchy and explain why `GetText()` does not do what you want. — stanthomas, Feb 07 '18 at 00:56

score 0 · Answer 1 · edited Mar 06 '18 at 12:07

0

EQUATION does not have a string value here. It doesn't contain any text in the markup. So you won't get anything back. You need to look at the attributes on the EQUATION element, e.g. ig->attribute("FILEREF"), where ig is the pointer to the structure representing the INLINEGRAPHIC element.

edited Mar 06 '18 at 12:07

Andrew Truckle

17,769
16
66
164

answered Feb 02 '18 at 18:10

barefootliam

619
3
7

score 0 · Answer 2 · answered Mar 06 '18 at 12:06

Just to build on the previous answer. This is what you have:

<LITERALLAYOUT>xxxxxxxxx
    <EQUATION>
        <INLINEGRAPHIC FILEREF="Mathematics/Arithmetic/WholeNumbers/IntroductionNumbers/eq.png" />
    </EQUATION>
</LITERALLAYOUT>

You have two things going on here. When you get to LITERALLAYOUT you can use GetText and that would return xxxxxxxxx.

But then you have a choice. If you want it generic you must iterate all child elements of your LITERALLAYOUT pointer. If you don't want to do that then you must extract the first child, eg:

XMLElement *pLITERALLAYOUT = xxxx; // You get this pointer.

XMLElement *pEQUATION = pLITERALLAYOUT->FirstChildElement("EQUATION");
if (pEQUATION != nullptr)
{
    // Now get the INLINEGRAPHIC element
    XMLElement *pINLINEGRAPHIC = pEQUATION->FirstChildElement("INLINEGRAPHIC");

   if (pINLINEGRAPHIC != nullptr)
   {
       const char * FILEREF;
       FILEREF = pINLINEGRAPHIC ->Attribute("FILEREF");
   }
}

See? You have to know the right way to navigate the XML file.

TinyXML2 C++ - Extracting specific data from old/poorly formatted XML files

2 Answers2