9

The xml file has this snippet:

<?xml version="1.0"?>
<PC-AssayContainer
    xmlns="http://www.ncbi.nlm.nih.gov"
    xmlns:xs="http://www.w3.org/2001/XMLSchema-instance"
    xs:schemaLocation="http://www.ncbi.nlm.nih.gov ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem.xsd"
>
....
    <PC-AnnotatedXRef>
      <PC-AnnotatedXRef_xref>
        <PC-XRefData>
          <PC-XRefData_pmid>17959251</PC-XRefData_pmid>
        </PC-XRefData>
      </PC-AnnotatedXRef_xref>
    </PC-AnnotatedXRef>

I tried to parse it using xpath's global search and also tried with some namespacing:

library('XML')
doc = xmlInternalTreeParse('http://s3.amazonaws.com/tommy_chheng/pubmed/485270.descr.xml')
>xpathApply(doc, "//PC-XRefData_pmid")
list()
attr(,"class")
[1] "XMLNodeSet"
> getNodeSet(doc, "//PC-XRefData_pmid")
list()
attr(,"class")
[1] "XMLNodeSet"
> xpathApply(doc, "//xs:PC-XRefData_pmid", ns="xs")
list()
> xpathApply(doc, "//xs:PC-XRefData_pmid", ns= c(xs = "http://www.w3.org/2001/XMLSchema-instance"))
list()

Shouldn't the xpath match:

<PC-XRefData_pmid>17959251</PC-XRefData_pmid>
Ben
  • 41,615
  • 18
  • 132
  • 227
tommy chheng
  • 9,108
  • 9
  • 55
  • 72
  • 1
    Not knowing anything about R, I'm assuming the `ns="xs"` and `ns= c(xs...` parts are declaring the namespaces used in your expression. That's likely the problem, since the element `PC-XRefData_pmid` is not a member of the `http://www.w3.org/2001/XMLSchema-instance` namespace but rather `http://www.ncbi.nlm.nih.gov`, which is the default namespace in the source document. Searching for `xs:PC-XRefData_pmid` is wrong. – Welbog Oct 06 '10 at 20:41
  • i assume i wouldn't need a namespace because the default one is xmlns="http://www.ncbi.nlm.nih.gov"? shouldn't the xpath query "//PC-XRefData_pmid" work? – tommy chheng Oct 06 '10 at 21:11

2 Answers2

9

Since the default namespace is the NIH one (whose URI is "http://www.ncbi.nlm.nih.gov"), <PC-XRefData_pmid> (and every other element in your XML document that has no namespace prefix) is in that NIH namespace.

So to match them with an XPath, you need to tell your XPath processor what prefix you're going to use for the NIH namespace, and you need to use that prefix in your XPath.

So, without knowing R, I would try

xpathApply(doc, "//nih:PC-XRefData_pmid",
   ns= c(nih = "http://www.ncbi.nlm.nih.gov"))

or else

getNodeSet(doc, "//*[local-name() = 'PC-XRefData_pmid']")

as the latter bypasses namespaces.

Just because the XML document declares the NIH namespace as the default one doesn't mean that the XPath processor will know that. In the XML information model, namespace prefixes are not significant. So when I parse in an XML document, it's not supposed to matter whether the NIH namespace is bound to the "nih:" prefix or the "snizzlefritz:" prefix or the "" (default) prefix. The XML parser or XPath processor is not supposed to have to know what prefix got bound to what namespace in the XML document. Especially since there could be several different prefixes bound to the same namespace at different places in the same document... and vice versa. So if you want to have your XPath expression match an element that's in a namespace, you have to declare that namespace to the XPath processor.

Edit: There are a few caveats, contributed by @Jim Pivarski:

  • The "doc" must be an xml node, not a document (class "XMLNode" or "XMLInternalElementNode", not "XMLDocument" or "XMLInternalDocument").
  • At least in Jim's version (XML_3.93-0), the named argument is "namespaces", not "ns".

So if "doc" is an instance of a document class, the correct solution is:

xpathApply(xmlRoot(doc), "//nih:PC-XRefData_pmid",
   namespaces = c(nih = "http://www.ncbi.nlm.nih.gov"))
LarsH
  • 27,481
  • 8
  • 94
  • 152
  • This is terrific, I've always wondered how to bypass namespaces because they're always giving me headaches. – Roman Luštrik Jul 24 '11 at 19:25
  • @WarrenFaith: I'm glad you brought in these extra caveats, that I didn't know about. I wonder though, if they'd fit better in a separate answer. As it is, it sounds (especially the last couple of sentences) like it's coming from me. – LarsH Oct 10 '12 at 01:59
  • @LarsH I didn't but I modified it as the last one (removed some unimportant stuff). – WarrenFaith Oct 10 '12 at 08:26
  • WF: Oh, I see, you just reviewed Jim's edit - and removed his self-attribution? I guess I'll edit it to make clear what he added. @JimPivarski: I think your caveats are valuable, but next time please put them in a comment or a separate answer. For this time, I'll incorporate them into my answer, with attribution. – LarsH Oct 10 '12 at 19:17
1

This is FAQ.

This: //PC-XRefData_pmid

Means: any PC-XRefData_pmid in document under no namespace or empty namespace

It doesn't means any PC-XRefData_pmid in document under default namespace

Plus, your document sample isn't completed, but it looks like your PC-XRefData_pmid element is under http://www.ncbi.nlm.nih.gov namespace

  • @Alejandro, can you provide a reference for the part in bold? I believe you, but want to know for sure that this is not just true of XPath in XSLT but of XPath in general, even when a default namespace declaration is passed to the XPath processor. – LarsH Oct 06 '10 at 21:56
  • @Alejandro: never mind, I see it at http://www.w3.org/TR/xpath/#node-tests. Would you say this is true of XPath 1.0 but not true of 2.0? since XSLT 2.0 lets you declare a default ns for XPath expressions. – LarsH Oct 06 '10 at 21:58
  • Thanks, I didn't know that info about // for xpath queries. – tommy chheng Oct 06 '10 at 22:02
  • @Alejandro: answering myself again. :-) According to http://www.w3.org/TR/xpath20/#node-tests, in XPath 2.0, "An **unprefixed QName**, when used as a name test on an axis whose principal node kind is element, has the namespace URI of the **default** element/type **namespace** in the expression context; otherwise, it has no namespace URI." But we assume @tommy is using XPath 1.0. – LarsH Oct 06 '10 at 22:06
  • @LarsH: I think this is the proper part of [specs](http://stackoverflow.com/questions/3876571/how-can-i-use-xpath-querying-using-rs-xml-library): `Two expanded-names are equal if they have the same local part, and either both have a null namespace URI or both have non-null namespace URIs that are equal.` Also, it looks like the proper term should be **null namespace URI** –  Oct 06 '10 at 22:07
  • Is xpath version tied to the xml version? at the top of the file, i have – tommy chheng Oct 06 '10 at 22:09
  • @LarsH: There is a diference between XPath 1.0 and XPath 2.0: the default namespace is part of evaluation context (I didn't remember wich one, static or dynamic), that's why you can use the `xsl:default-xpath-namespace` attribute in XSLT 2.0 –  Oct 06 '10 at 22:13
  • @tommy chheng: You wrote: `Is xpath version tied to the xml version?`. No. It depends on your XPath engine. –  Oct 06 '10 at 22:15