2

Given a XML file with multiple namespaces defined, what is the simplest way to search the DOM for elements just in the default namespace using an XPath query?

As the title suggests this is using Perl and libXML.

Furthermore, is it possible to do this without hardcoding the namespace (if using XPathContext to define the namespace is it possible to query the default namespace of the file)

What I'm trying to achieve:
I'm searching many xlsx spreadsheet documents of different ages for certain formulas and processing these. I was homing to just use a simple findnodes(//f) to gather all formulas in each sheet. All of the sheets have multiple namespaces defined but most elements don't seem to have a fully qualified namespace. For example:

<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:xdr="http://schemas.openxmlformats.org/drawingml/2006/spreadsheetDrawing" xmlns:x14="http://schemas.microsoft.com/office/spreadsheetml/2009/9/main" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" mc:Ignorable="x14ac" xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac">
<sheetData>
    <row r="1">
        <c r="A1">
            <f>SUM(1+2)</f>
            <v>3</v>
        </c>
        <c r="A2">
            <f>SUM(4+5)</f>
            <v>9</v>
        </c>
...
<controls>
    <mc:AlternateContent xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006">
        <mc:Choice Requires="x14">
            <control shapeId="1" r:id="rId4" name="blah">
...

As I mentioned above I only care about the formulas ie: in the example above "SUM(1+2)" and "SUM(4+5)".

How can I extract just this data out?
The solution doesn't have to be pretty but it does have to always work (I'm not sure if the namespaces change much.)

I could just pipe everything through grep/sed, but was hoping properly parsing it wouldn't be too hard...

maloo
  • 380
  • 4
  • 12
  • Default namespace only exists in the syntax. In the DOM model, each element belongs to a namespace, and there's no way to detect whether it's the default one or not - in fact, a document with a default namespace and a document with the same name space explicitly mentioned at each element are semantically equivalent. Why do you need it? – choroba Nov 23 '18 at 15:27
  • Arh ok, I'll add a bit more context to the question... – maloo Nov 23 '18 at 15:31
  • What does `findnodes('//*[local-name()="f"]')` return? Are there any `f`s in other namespaces you don't want? – choroba Nov 23 '18 at 15:59
  • Thanks @choroba - It doesn't matter for my use case if I do pick up extra elements (even if they are garbage) as I'm processing these further later down the track. Anyway using that XPath syntax worked for me - feel free to stick it in as an answer:) – maloo Nov 23 '18 at 16:12

2 Answers2

4

You can ignore the namespaces completely with local-name():

...->findnodes('//*[local-name()="f"]')

Note that in general, it's not the best idea. E.g., if the syntax of the formulas depended on the version and you needed to normalize them, you would search for formulas in each namespace separately and run different conversions based on the namespace.

choroba
  • 231,213
  • 25
  • 204
  • 289
  • Thanks choroda - I can see why this isn't the ideal solution, but in my use case I'm just searching to see how many times a particular function in excel spreadsheets is used (and will only need to be run once...) – maloo Nov 23 '18 at 16:22
  • It seems this might come in handy when the same DTD might be addressed as http.. or https... – Dan Jacobson Mar 25 '23 at 06:25
3

There's no such thing as the default namespace. The default can be different from tag to tag. You're actually asking for the namespace of the root element. You'd want to do this to support a few "similar enough" formats, and it's done as follows:

use XML::LibXML               qw( );
use XML::LibXML::XPathContext qw( );

my $doc = XML::LibXML->new->parse_string($xml);

my $root_ns = $doc->documentElement->namespaceURI;

my $xpc = XML::LibXML::XPathContext->new();
$xpc->registerNs( xl => $root_ns );

$xpc->findnodes('//xl:f', $doc)

But you didn't present any reason not to use the known namespace. You should simply use the following:

use XML::LibXML               qw( );
use XML::LibXML::XPathContext qw( );

my $doc = XML::LibXML->new->parse_string($xml);

my $xpc = XML::LibXML::XPathContext->new();
$xpc->registerNs( xl => 'http://schemas.openxmlformats.org/spreadsheetml/2006/main' );

$xpc->findnodes('//xl:f', $doc)
ikegami
  • 367,544
  • 15
  • 269
  • 518