Should I Use Regex to Find the XML Namespace Definition?

Question

The script below works. It parses a XML and looks up a particular node under the namespace "dei".

But is relying on regex for the namespace definition the proper way? (I do not really know XML. So I worry that such regex is not fool-proof for all Edgar XMLs. For example -- are such definitions always enclosed in double quotes and preceded by xmlns: ?)

Thanks.

use strict;
use warnings;

use LWP::Simple;
use XML::LibXML;
use XML::LibXML::XPathContext;

my $url = 'https://www.sec.gov/Archives/edgar/data/1057051/000119312517099664/acef-20161231.xml';
my $xml = LWP::Simple::get($url);
my $dom = XML::LibXML->load_xml(string => $xml);

my @nsDefs = ($xml =~ /xmlns:dei="(.+?)"/g);
die "Namespace definition must be unique!\n" unless @nsDefs == 1;

my $xpc = XML::LibXML::XPathContext->new($dom);
$xpc->registerNs('dei', $nsDefs[0]);

my @matches = $xpc->findnodes('//dei:TradingSymbol');
print 'Number of matches = ', scalar(@matches), "\n";

Output:

Number of matches = 1

No, they can be in simple quotes, and someone could have the weird idea of replacing a / with / for instance. Long story short, you can't parse XML with regexes, it will never do the full job. More importantly, you can't search for a node that contains xmlns:something. This information has no value and there is no reason why the node that declares it is the one you want. Nor for this declaration to be unique in the document. Maybe it is, maybe it's not, and it's none of your business. You shouldn't be looking for it. What you're looking for is something else. — kumesana, Sep 12 '17 at 20:54
Thx Kumesana. What you said is exactly what I feared. But what is the proper way then? My situation: All the XMLs I work with will use a "dei" namespace, which is of interest to me. But different XMLs may have different definitions for "dei". So how am I supposed to know what the definition is (in order to parse it with a DOM)? For example, this XML has a different definition than that in my OP. https://www.sec.gov/Archives/edgar/data/104207/000010420712000098/wag-20120831.xml — Shang Zhang, Sep 12 '17 at 21:06
See the other answer, they understood better than I what you had in mind. — kumesana, Sep 12 '17 at 21:09
Re "*So how am I supposed to know what the definition is*", That's not the right question. Both namespaces/specs could be used in the same doc. The correct question is: Which specs (and thus namespaces) are used by the doc? — ikegami, Sep 13 '17 at 06:21

score 1 · Answer 1 · answered Sep 12 '17 at 21:00

1

The only important thing about a namespace in XML is the URI. Your code is assuming a namespace prefix of dei, using that to locate the namespace declaration and determine that the URI is http://xbrl.sec.gov/dei/2014-01-31. This is exactly backwards. The thing you should be hard-coding in your script is the URI - it won't change. The namespace prefix is theoretically variable and a different prefix might be used for the same URI in other documents.

answered Sep 12 '17 at 21:00

Grant McLean

6,898
1
21
37

Come to think of it, there could also be no prefix at all. – kumesana Sep 12 '17 at 21:03
Grant. Please see my comments above. My actual situation is that I know all the XMLs will hold the information I need under a namespace "dei". But sometimes it is http://xbrl.sec.gov/dei/2014-01-31 (but other times, it could be "http://xbrl.sec.gov/dei/2012-01-31" -- depending on the time the XML was produced). What is the proper thing to do? – Shang Zhang Sep 12 '17 at 21:14
OK, I understand now. @ikegami's solution of registering both URIs and using an XPath query to match either is the way I would do it too. – Grant McLean Sep 13 '17 at 10:12

Miller · Answer 2 · 2017-09-13T13:07:17.703

1

use getNamespaces()

my @ns_dei = grep { $_->name eq 'xmlns:dei' } $dom->documentElement()->getNamespaces();

die "Namespace definition must be unique!\n" if @ns_dei != 1;

my $xpc = XML::LibXML::XPathContext->new($dom);
$xpc->registerNs( 'dei', $ns_dei[0]->value );

edited Sep 13 '17 at 13:07

answered Sep 13 '17 at 03:49

Miller

34,962
4
39
60

ikegami · Answer 3 · 2017-09-13T06:22:41.083

dei is not a namespace; it's a prefix that's only meaningful in that particular document. You can't count on the namespace's prefix always being dei.

http://xbrl.sec.gov/dei/2014-01-31 is the namespace. That's the thing that can't change, and that you should be basing your code around.

In a comment, you mentioned you have to deal with multiple specs. Just create an XPath prefix for each spec you support.

use strict;
use warnings;

use LWP::Simple               qw( );
use XML::LibXML               qw( );
use XML::LibXML::XPathContext qw( );

my $url = 'https://www.sec.gov/Archives/edgar/data/1057051/000119312517099664/acef-20161231.xml';

my $xml = LWP::Simple::get($url);

my $doc = XML::LibXML->load_xml(string => $xml);

my $xpc = XML::LibXML::XPathContext->new();
$xpc->registerNs( d1 => 'http://xbrl.sec.gov/dei/2012-01-31' );
$xpc->registerNs( d2 => 'http://xbrl.sec.gov/dei/2014-01-31' );

my @matches = $xpc->findnodes('//d1:TradingSymbol|//d2:TradingSymbol', $doc);
print "Number of matches = ", 0+@matches, "\n";

Michael Kay · Answer 4 · 2017-09-13T16:28:27.280

0

Never use regular expressions to process XML: your code will always be wrong. Your example has at least five bugs: it will fail to match if a different prefix is used, it will fail to match if single quotes are used, it will fail to match if there is whitespace around the "=" sign, it will error if the namespace declaration is duplicated, and it will give a spurious match if there is "commented out" XML in the source document.

It is theoretically impossible to eliminate these bugs, because regular expressions are not powerful enough to parse XML correctly.

Always use a real XML parser, and XPath.

edited Sep 13 '17 at 16:28

answered Sep 12 '17 at 22:46

Michael Kay

156,231
11
92
164

1

Anonymous downvoter: Downvote only wrong, especially harmfully wrong, answers -- not answers that you don't want to hear. This answer is correct; its reasoning, sound. – kjhughes Sep 13 '17 at 00:13
Re "*regular expressions are not powerful enough to parse XML correctly.*", That's not true. XML is trivial to parse using regex. That's not the reason regex are discouraged. They are discouraged because using them to parse XML is reinventing the wheel (XML parser), and it's almost guaranteed to be reinvented really, really poorly. – ikegami Sep 13 '17 at 05:59
1

@ikegami, you are 100% wrong. XML is not a regular language, because its grammar is recursive. It *cannot* be parsed correctly using regular expressions. – Michael Kay Sep 13 '17 at 08:58
It doesn't have to be a regular language to be parsed using the regular expressions the OP is using. Furthermore, the OP didn't give any indication that they would parse the document using a single match operator. When you're done erecting straw men (by pretending the OP is doing something completely different than they are doing) just so you can lecture them and sound smart, please fix your answer. – ikegami Sep 13 '17 at 15:45
1

I'll fix my answer when I see a regular expression used to parse XML without any bugs in it. – Michael Kay Sep 13 '17 at 16:24

score 0 · Answer 5 · answered Sep 13 '17 at 10:07

I understand that your problem is that the XML you read will not always use the same URI as namespace to attach to the dei: prefix and the elements you're looking using it.

In that case the XML you're stuck with is ill-designed and there is no good practice established for that. This XML is using namespaces wrong and you will need to work around that. For information, changing an element's namespace is by definition changing its name, and therefore the most basic information you're using to find it.

Your best bet is to ignore namespaces whatsoever. You can do that with

//*[local-name () = "TradingSymbol"]

If the number of different namespaces you can get is limited to a select few, you could instead list them all, as dei: and dei2012: for instance, and select for both:

//dei:TradingSymbol | //dei2012:TradingSymbol

Shang Zhang · Answer 6 · 2017-09-14T20:32:32.980

Thanks to everyone who answered. I am very inexperienced in terms of using Perl to grab data from Internet (SEC Edgar filings in this particular case). So I am probably not even asking the most intelligent questions.

The business problem (per my best understanding): 1) When a company files its 10K/Q using XBRL, SEC wants the trading symbol information disclosed based on one of SEC's published schemas. 2) The complete list of schema locations is known (and will grow):

-- http://taxonomies.xbrl.us/us-gaap/2009/non-gaap/dei-2009-01-31.xsd
-- https://xbrl.sec.gov/dei/2012/dei-2012-01-31.xsd
-- https://xbrl.sec.gov/dei/2013/dei-2013-01-31.xsd
-- https://xbrl.sec.gov/dei/2014/dei-2014-01-31.xsd

3) I want to grab such trading symbol information.

I now understand that the "dei" namespace-prefix has no real significance. But it seems that even the namespace-name itself e.g. 'http://xbrl.sec.gov/dei/2012-01-31' has no significance. Only the schema location is truly meaningful. Is this correct?

My understanding is that the XBRL instance document references a schema document which "maps" the namespace (e.g. http://xbrl.sec.gov/dei/2012-01-31) to the schema location. (So the namespace-name only needs to be a unique string.)

So is there a way to modify ikegami's code to use the schema locations instead of the namespace names?

Example of a complete XRBL filing: https://www.sec.gov/Archives/edgar/data/1057051/000119312517099664

Should I Use Regex to Find the XML Namespace Definition?

6 Answers6