0

I'm writing an android app in java. The app emulates flashcards, with questions on one side and answers on the other.
I am presently slurping a well-formed (as I believe) .xml document (which is produced by a Qt-based program which has no problem reading the output back in) using the following (fairly standard) code:

    DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
    try
    {
        DocumentBuilder builder = factory.newDocumentBuilder();
        Document dom = builder.parse(new File(diskLocation));
        Element pack = dom.getDocumentElement();
        NodeList flashCards = pack.getElementsByTagName("flashcard");
        for (int i=0; i < flashCards.getLength(); i++)
        {
            FlashCard flashCard = new FlashCard();

            Node cardNode = flashCards.item(i);
            NodeList cardProperties = cardNode.getChildNodes();
            for (int j=0;j<cardProperties.getLength();j++)
            {
                Node cardProperty = cardProperties.item(j);
                String propertyName = cardProperty.getNodeName();
                if (propertyName.equalsIgnoreCase("Question"))
                {
                    flashCard.setQuestion(cardProperty.getFirstChild().getNodeValue());
                }
                else if (propertyName.equalsIgnoreCase("Answer"))
                {
                    flashCard.setAnswer(cardProperty.getFirstChild().getNodeValue());
                }
                else if
                    ...etc.

Here is a flashcard for learning xml:

 <flashcard>
  <Question>What is the entity reference for ' " '?</Question>
  <Answer>&amp;quot;</Answer>
  <Info></Info>
  <Hint></Hint>
  <KnownLevel>1</KnownLevel>
  <LastCorrect>1</LastCorrect>
  <CurrentStreak>4</CurrentStreak>
  <LevelUp>4</LevelUp>
  <AnswerTime>0</AnswerTime>
 </flashcard>

As I understand the standard, '<' and '&' need to be escaped ('>' probably should be), but quotes and apostrophes don't (unless they're in attributes), yet when the question and answer for this card are parsed, they come out as What is the entity reference for ' and & respectively;

The input seems to follow standards. Is the java XMLDom implementation really not standards-compliant, or am I missing something?

I find it very difficult to believe I'm the only one to have (had) this problem, yet I've searched both google and stack overflow and found surprisingly little of direct relevance.

Thank you for any help!

Rob

Edit: I've just realised the file has a !DOCTYPE, but doesn't start with an <?xml tag.
I wonder if this makes any difference.

hakre
  • 193,403
  • 52
  • 435
  • 836
M_M
  • 1,955
  • 2
  • 17
  • 23
  • Do you have *multiple* node values representing the text (i.e. below your property elements) ? – Brian Agnew Aug 20 '12 at 15:20
  • I just tried your xml and it is working fine producing `"` and `What is the entity reference for ' " '?`. Can you post your complete xml file and maybe any pre-processing you're doing on it? – Daniel Moses Aug 20 '12 at 15:42
  • @DMoses: Thanks for questioning the doctype, I'm looking at it suspiciously, and think it may lead to a solution... I will post either my solution or the full file if I cannot solve it. – M_M Aug 20 '12 at 15:51
  • Turns out it wasn't the doctype, but rather that the string seems to be split into multiple child nodes (as I found in the debugger) - so for example: two child nodes for the answer: `&` and `quot;`. In another case, I found the full string at index 12 in an array, preceded by 12 null elements. So it would seem just using `getFirstChild()` is the problem, and I need to concatenate all child nodes. I can't use `getTextValue()` as that requires API level 8, and I'm supporting 7... Does anyone know of an easier way? – M_M Aug 20 '12 at 16:16
  • @BrianAgnew: Yes, yes I do! I have no idea why (perhaps I need to get to know XMLDom a little better). I have worked around the promlem like so: `NodeList childNodes = cardProperty.getChildNodes(); String propertyValue = new String(); for (int k = 0; k < childNodes.getLength(); ++k) { if (childNodes.item(k).getNodeValue() != null) propertyValue += childNodes.item(k).getNodeValue(); }` and then replacing each instance of `cardProperty.getFirstChild().getNodeValue()` with `nodeValue`. Do you know an easier way? – M_M Aug 20 '12 at 16:50

1 Answers1

0

From the standard:

In the content of elements, character data is any string of characters which does not contain the start-delimiter of any markup

which means that either ' or " MUST be escaped in the content of elements.

LJ2
  • 593
  • 5
  • 11
  • 1
    That is wrong from that same standard. < and & are the only start-delimitars within that scope and all parsers I know of handle it as such (source: http://www.w3.org/TR/2000/REC-xml-20001006) – Daniel Moses Aug 20 '12 at 15:32
  • 1
    The data in the element does not contain any unescaped start-delimiters. – M_M Aug 20 '12 at 15:44
  • Shoot, you're right! The grammar clearly specifies such, beyond the ambiguous explanation. Sorry for the bad advice. – LJ2 Aug 20 '12 at 16:06