2

I'm parsing a Swedish library catalogue using R and the XML-package. Using the library's API, I'm getting XML back from a url containing my query.

I'd like to use xPath queries to parse each record, but everything I do with xPath of the XML-package returns blank lists, everything except "//*". I'm no expert in either xml-parsing nor xPath, but I suspect that it has to do with the xml that my API returns to me.

This is a simple example of one single post in the catalogue:

library(XML)

example.url <- "http://libris.kb.se/sru/swepub?version=1.1&operation=searchRetrieve&query=mat:dok&maximumRecords=1&recordSchema=mods"
doc = xmlParse(example.url)

# Title
works <- xmlRoot(doc)[[4]][["record"]][["recordData"]][["mods"]][["titleInfo"]][["title"]][[1]]
doesntwork <- getNodeSet(doc, "//title")

# The only xPath that returns anything
onlythisworks <- getNodeSet(doc, "//*")

If this has something to do with namespaces (as these answers sugests), all I understan about it is that the API returns data that seems to have namespaces defined in the initial tag, and that I could use that, but this doesn't help me:

# Namespaces are confusing:
title <- getNodeSet(xmlRoot(doc), "//xsi:title", namespaces = c(xsi = "http://www.w3.org/2001/XMLSchema-instance"))

Here's (again) the example return data that I'm trying to parse.

Community
  • 1
  • 1
nJGL
  • 819
  • 5
  • 17

1 Answers1

1

You have to use the right namespace. Try the following

doesntwork <- getNodeSet(doc, "//mods:title")
#[[1]]
#<title>Horizontal Slot Waveguides for Silicon Photonics Back-End Integration [Elektronisk resurs]</title> 
#
#[[2]]
#<title>TRITA-ICT/MAP AVH, 2014:17                      \
#                           </title> 
#
#attr(,"class")
#[1] "XMLNodeSet"

BTW: I usually get the namespaces via

nsDefs=xmlNamespaceDefinitions(doc,simplify = TRUE,recursive=TRUE)

But this throws an error in your case. It complains that there are different URIs for the same name space prefix. According to this site this does not seem to be good coding style.


Update as per OP's comment

I am myself not an xml expert, but here is my take: You can define default namespaces via <tag xmlns=URI>. Non default namespaces are of the form <tag xmlns:a=URI> with a being the respective namespace name. The problem with your document is that there are two different default namespaces. The first being in <searchRetrieveResponse xmlns="http://www.loc.gov/zing/srw/" ... >. The second is in <mods xmlns="http://www.loc.gov/mods/v3" ... >. Also, you will find the second default namespace URI in the first tag as xmlns:mods="http://www.loc.gov/mods/v3" (where it is non-default). This seems rather messy. Now, the <title> tag is within the <mods> tag. I think that the default namespace defined in <mods> gets overridden by the non default namespace of searchRetrieveResponse (because they have the same URI). So although <mods> and all following tags (like <title>) seems to have default namespaces they actually have the xmlns:mods namespace. But this does not apply to the tag <numberOfRecords> (because it's outside of <mods>). You can access this node via

getNodeSet(doc, "//ns:numberOfRecords",
       namespaces = c(ns="http://www.loc.gov/zing/srw/"))

Here you extract the default namespace defined in <searchRetrieveResponse> and give it a name (ns in our case). Then you can explicitly use the default namespace name in your xPath query.

cryo111
  • 4,444
  • 1
  • 15
  • 37
  • I need to make time to learn about this namespace business. Good link to start out with. Thank you @cryo111 . xmlNamespaceDefinitions() does list the namespaces in the above example like this: `nsDefs=xmlNamespaceDefinitions(doc,simplify = TRUE)`, without recursive. – nJGL Jan 13 '16 at 17:43
  • How should one know that it's the mods-namespace that I should use? Now I can only xPath to tags with the mods:-namespace, in spite of trying all other declared namespaces, I can't figure out how to make an xPath to or that are not nested within the -tag... – nJGL Jan 14 '16 at 13:47
  • Wow. Thanks a million. Basically, I have to work around my xml-response being messy. I found [this link](https://msdn.microsoft.com/en-us/library/bb986013.aspx) instructive for general namespace-stuff. But when I want to loop through books I'll have to skip between namespaces. Building on your update above @cryo111, I got this line working with multiple namespaces: `getNodeSet(doc, "//ns:record[2]//mods:title", namespaces = c(ns="http://www.loc.gov/zing/srw/", mods="http://www.loc.gov/mods/v3"))`, and can go ahead and scrap the data that I want. All is now well. Thanks again! – nJGL Jan 14 '16 at 16:12
  • @nJGL You are welcome. Good link on namespaces that you have provided. – cryo111 Jan 14 '16 at 16:49