3

I'm using XML Simple to parse an XML file, the problematic part looks like that:

    <textBody>
        <title>
            <titlePart>
                <text>SECTION A <emdash/> HUMAN NECESSITIES</text>
            </titlePart>
        </title>
    </textBody>
    <ipcEntry kind="t" symbol="A01" ipcLevel="C" entryType="K" lang="EN">
        <textBody>
            <title>
                <titlePart>
                    <text>AGRICULTURE</text>
                </titlePart>
            </title>
        </textBody>
    </ipcEntry

for some reason XML::Simple completely ignores <text>SECTION A <emdash/> HUMAN NECESSITIES</text> I guess its because the emdash tag, because <text>AGRICULTURE</text> is parsed just fine. I also tried setting the parser by:

$XML::Simple::PREFERRED_PARSER = 'XML::Parser';

still no go. Any idea?

Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
snoofkin
  • 8,725
  • 14
  • 49
  • 86
  • [Why is XML::Simple "discouraged"](http://stackoverflow.com/questions/33267765/why-is-xmlsimple-discouraged) – Sobrique Nov 23 '15 at 17:25

2 Answers2

5

Having a tag whose value includes both text and other tags is called "mixed content". XML::Simple doesn't handle mixed content (not usefully, anyway). In XML::Simple's view of the universe, a tag can contain either text or other tags, not both. That's why it's called "Simple". To quote its docs:

Mixed content (elements which contain both text content and nested elements) will be not be represented in a useful way - element order and significant whitespace will be lost. If you need to work with mixed content, then XML::Simple is not the right tool for your job

You'll have to pick a different XML module. XML::LibXML and XML::Twig are popular choices.

Another possibility would be to get whoever produced the XML to use entities instead of tags to represent characters like a dash. For example, XML::Simple could handle:

<text>SECTION A &#8212; HUMAN NECESSITIES</text>

just fine. (&#8212; is an em dash.)

cjm
  • 61,471
  • 9
  • 126
  • 175
4

XML::Simple is parsing it all but it doesn't handle mixed content that well, from the fine manual:

Mixed content (elements which contain both text content and nested elements) will be not be represented in a useful way - element order and significant whitespace will be lost. If you need to work with mixed content, then XML::Simple is not the right tool for your job - check out the next section.

For example, this:

use Data::Dumper;
use XML::Simple;
print Dumper(XMLin(qq{
    <textBody>
        <title>
            <titlePart>
                <text>SECTION A <emdash/> HUMAN NECESSITIES</text>
            </titlePart>
        </title>
    </textBody>
}));

Yields:

$VAR1 = {
    'title' => { 
        'titlePart' => { 
            'text' => { 
                'emdash' => {}, 
                'content' => [ 
                    'SECTION A ', 
                    ' HUMAN NECESSITIES'
                ]
            }
        }   
    }
};

So the emdash is there but the mixed content is rather mixed up.

mu is too short
  • 426,620
  • 70
  • 833
  • 800