1

I would like to add an attribute to the very top node of an xml file and then save the file. I've tried every combination of xpath and subsetting I can think of, but just can't seem to make it work. To use a simple example:

xml_string = c(
 '<?xml version="1.0" encoding="UTF-8"?>',
 '<retrieval-response status = "found">',
      '<coredata>',
           '<id type = "author" >12345</id>',
      '</coredata>',
      '<author>',
           '<first>John</first>',
           '<last>Doe</last>',
      '</author>',
 '</retrieval-response>')

# parse xml content
xml = xmlParse(xml_string)

When I try

xmlAttrs(xml["/retrieval-response"][[1]]) <- c(id = 12345)

I get an error:

object of type 'externalptr' is not subsettable

However, the attribute is inserted, so I'm not sure what I'm doing wrong.

(more background: this is a simplified version of the data from Scopus's API. I am combining thousands of xml files structured similarly, but the id is in the "coredata" node which is a sibling to the "author" node which contains all of the data, so when I use SAS to compile the combined XML document into datasets there is no link between the id and the data. I'm hoping that adding the id to the top of the hierarchy will cause it to propagate down to all of the other levels).

Sarah Hailey
  • 494
  • 5
  • 19
  • This can easily be done with [XSLT](http://www.w3schools.com/xsl/) the language that re-structures XML documents to any nuanced need. And if by [SAS](https://www.sas.com/en_us/home.html), you mean the statistical package, then we can use its [proc xsl](http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a003356144.htm). Please tag this with SAS and provide actual sample of XML document and desired dataset result. – Parfait Nov 10 '15 at 18:40
  • [Here](https://dl.dropboxusercontent.com/u/8428744/example_file.xml)'s an example file. I have over 11,000 files like this and I used a program called mergex.exe to merge them into one big XML file. Then I used SAS's XML mapper to import the XML file into SAS. Very handy, but the structure of the XML file makes it impossible to link the id to the author info. Ideally, I'd have every dataset produced in SAS contain the author id (which I pull from the XML file using `as.numeric(sub("AUTHOR_ID:", "", xmlValue(xml["//dc:identifier"][[1]])))` – Sarah Hailey Nov 10 '15 at 21:46

2 Answers2

2

In order to migrate XML data into two-dimensions of rows and columns according to the structure of datasets and dataframes, all nests must be removed to only iterating parent and one child level. Therefore, XSLT, the special-purpose declarative programming language that restructures XML documents to any nuanced needs, comes in handy to restructure XML data for end use needs.

Given your example XML, below is an XSLT that can be run and resulting XML be successfully imported into SAS. Have the SAS code looped to restructure all thousands of XML files.

XSLT (save as .xsl or .xslt format)

 <xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
       xmlns:ait="http://www.elsevier.com/xml/ani/ait"
       xmlns:ce="http://www.elsevier.com/xml/ani/common"
       xmlns:cto="http://www.elsevier.com/xml/cto/dtd"
       xmlns:dc="http://purl.org/dc/elements/1.1/"
       xmlns:ns1="http://webservices.elsevier.com/schemas/search/fast/types/v4"
       xmlns:prism="http://prismstandard.org/namespaces/basic/2.0/"
       xmlns:xocs="http://www.elsevier.com/xml/xocs/dtd"
       xmlns:xoe="http://www.elsevier.com/xml/xoe/dtd"
       exclude-result-prefixes="ait ce cto dc ns1 prism xocs xoe">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />

 <xsl:template match="author-retrieval-response">
  <xsl:variable select="substring-after(coredata/dc:identifier, ':')" name="authorid"/>
  <root>
      <coredata>
        <authorid><xsl:value-of select="$authorid"/></authorid>
        <xsl:for-each select="coredata/*">          
          <xsl:element name="{local-name()}">      
            <xsl:value-of select="concat(.,@href)"/>
          </xsl:element>
        </xsl:for-each>
      </coredata>

      <subjectAreas>
        <authorid><xsl:value-of select="$authorid"/></authorid>
        <xsl:for-each select="subject-areas/*">          
          <xsl:element name="{local-name()}">      
            <xsl:value-of select="."/>
          </xsl:element>
        </xsl:for-each>
      </subjectAreas>

      <authorname>
        <authorid><xsl:value-of select="$authorid"/></authorid>
        <xsl:for-each select="author-profile/preferred-name/*">          
          <xsl:element name="{local-name()}">      
            <xsl:value-of select="."/>
          </xsl:element>
        </xsl:for-each>
      </authorname>

      <classifications>
        <authorid><xsl:value-of select="$authorid"/></authorid>
        <xsl:for-each select="author-profile/classificationgroup/classifications/*">          
          <xsl:element name="{local-name()}">      
            <xsl:value-of select="."/>
          </xsl:element>
        </xsl:for-each>
      </classifications>

      <journals>
        <authorid><xsl:value-of select="$authorid"/></authorid>
        <xsl:for-each select="author-profile/journal-history/journal/*">          
          <xsl:element name="{local-name()}">      
            <xsl:value-of select="."/>
          </xsl:element>
        </xsl:for-each>
      </journals>

      <ipdoc>
        <authorid><xsl:value-of select="$authorid"/></authorid>
        <xsl:for-each select="author-profile/affiliation-current/affiliation/ip-doc/*[not(local-name()='address')]">          
          <xsl:element name="{local-name()}">      
            <xsl:value-of select="."/>
          </xsl:element>
        </xsl:for-each>
      </ipdoc>

      <address>
        <authorid><xsl:value-of select="$authorid"/></authorid>
        <xsl:for-each select="author-profile/affiliation-current/affiliation/ip-doc/address/*">          
          <xsl:element name="{local-name()}">      
            <xsl:value-of select="."/>
          </xsl:element>
        </xsl:for-each>
      </address>  
  </root>
 </xsl:template>

</xsl:transform>

SAS (using above script)

proc xsl 
    in="C:\Path\To\Original.xml"
    out="C:\Path\To\Output.xml"
    xsl="C:\Path\To\XSLT.xsl";
run;

** STORING XML CONTENT;
libname temp xml 'C:\Path\To\Output.xml'; 

** APPEND CONTENT TO SAS DATASETS;
data Work.Coredata; 
    retain authorid;
    set temp.Coredata;  ** NAME OF PARENT NODE IN XML;
run;

data Work.SubjectAreas; 
    retain authorid;
    set temp.SubjectAreas;  ** NAME OF PARENT NODE IN XML;
run;

data Work.Authorname;   
    retain authorid;
    set temp.Authorname;  ** NAME OF PARENT NODE IN XML;
run;

data Work.Classifications;
    retain authorid;
    set temp.Classifications;  ** NAME OF PARENT NODE IN XML;
run;

data Work.Journals; 
    retain authorid;
    set temp.Journals;  ** NAME OF PARENT NODE IN XML;
run;

data Work.Ipdoc;    
    retain authorid;
    set temp.Ipdoc;  ** NAME OF PARENT NODE IN XML;
run;

XML OUTPUT (which is imported as Authorsdata dataset of one row and 40 variables)

<?xml version="1.0" encoding="UTF-8"?>
<root>
   <coredata>
      <authorid>1234567</authorid>
      <url>http://api.elsevier.com/content/author/author_id/1234567</url>
      <identifier>AUTHOR_ID:1234567</identifier>
      <eid>9-s2.0-1234567</eid>
      <document-count>3</document-count>
      <cited-by-count>95</cited-by-count>
      <citation-count>97</citation-count>
      <link>http://api.elsevier.com/content/search/scopus?query=refauid%1234567%29</link>
      <link>http://www.scopus.com/authid/detail.url?partnerID=HzOxMe3b&amp;authorId=1234567&amp;origin=inward</link>
      <link>http://api.elsevier.com/content/author/author_id/1234567</link>
      <link>http://api.elsevier.com/content/search/scopus?query=au-id%281234567%29</link>
   </coredata>
   <subjectAreas>
      <authorid>1234567</authorid>
      <subject-area>Human-Computer Interaction</subject-area>
      <subject-area>Control and Systems Engineering</subject-area>
      <subject-area>Software</subject-area>
      <subject-area>Computer Vision and Pattern Recognition</subject-area>
      <subject-area>Artificial Intelligence</subject-area>
   </subjectAreas>
   <authorname>
      <authorid>1234567</authorid>
      <initials>A.</initials>
      <indexed-name>John A.</indexed-name>
      <surname>John</surname>
      <given-name>Doe</given-name>
   </authorname>
   <classifications>
      <authorid>1234567</authorid>
      <classification>1709</classification>
      <classification>2207</classification>
      <classification>1712</classification>
      <classification>1707</classification>
      <classification>1702</classification>
   </classifications>
   <journals>
      <authorid>1234567</authorid>
      <sourcetitle>Very Prestigious Journal</sourcetitle>
      <sourcetitle-abbrev>V PRES JOU Autom</sourcetitle-abbrev>
      <issn>10504729</issn>
      <sourcetitle>2005 Another Prestigious Journal</sourcetitle>
      <sourcetitle-abbrev>An. Prest. Jou. </sourcetitle-abbrev>
   </journals>
   <ipdoc>
      <authorid>1234567</authorid>
      <afnameid>Prestigious University#1111111</afnameid>
      <afdispname>Prestigious University University</afdispname>
      <preferred-name>Prestigious University University</preferred-name>
      <sort-name>Prestigious University</sort-name>
      <org-domain>pu.edu</org-domain>
      <org-URL>http://www.pu.edu/index.shtml</org-URL>
   </ipdoc>
   <address>
      <authorid>1234567</authorid>
      <address-part>1234 Prestigious Lane</address-part>
      <city>City</city>
      <state>ST</state>
      <postal-code>12345</postal-code>
      <country>United States</country>
   </address>
</root>

R ALTERNATIVE

Since no comprehensive R XSLT library exists, parsing will have to be done directly in R language. However, R can call XSLT processors of other executables (i.e. Python, Saxon, VBA) through command line, RCOMClient package, and other interfaces.

Nonetheless, R can extract XML data by xmlToDataFrame() and xpathSApply() (the latter being similar to XPath) for the authorid:

library(XML)

coredata <- xmlToDataFrame(nodes = getNodeSet(doc, '//coredata'))
coredata$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
                          xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])

subjectareas <- xmlToDataFrame(nodes = getNodeSet(doc, "//subject-areas"))
subjectareas$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
                              xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])

authorname <-  xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/preferred-name'))
authorname$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
                            xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])

classifications <- xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/classificationgroup/classifications'))
classifications$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
                                 xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])

journal <- xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/journal-history/journal'))
journal$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
                         xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])

ipdoc <- xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/affiliation-current/affiliation/ip-doc'))
ipdoc$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
                       xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])

address <- xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/affiliation-current/affiliation/ip-doc/address'))
address$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
                         xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • What kind of witchcraft... ! This is an amazing answer, thank you for being so thorough. Actually, the goal is not to have one big dataset with all of the info in one, but to have the relational datasets with each set of information separate but a unique identifier in each. I will read over this carefully and learn everything I can, and unless you have other thoughts I will mark this as an answer soon. – Sarah Hailey Nov 12 '15 at 15:56
  • 1
    See update. XSLT can use [variables](http://www.w3schools.com/xsl/el_variable.asp) which can be passed into other parts of the document even parse out `Author:` with its [substring-after](http://zvon.org/xxl/XSLTreference/OutputOverview/function_substring-after_frame.html) function. So, `authorid` can be passed into the other related nodes. In fact, I just learned myself here that SAS can import multiple tables from one XML! Certainly will add this example to my library. As for R, simply use `xmltodataframe` for node sets and `xmlSApply()` for authorids. Thanks for question! – Parfait Nov 13 '15 at 00:49
1

Edit: After trying the approach of editing the top node (see Old Answer below), I realized that editing the top node doesn't solve my problem because the SAS XML mapper did not retain all of the ids.

I tried a new approach of adding the author id to each of the subnodes which worked perfectly. I also learned that you can use XPath to select multiple nodes by putting them into a vector, like this:

c("//coredata",
  "//affiliation-current",
  "affiliation-history",
  "subject-areas",
  "//author-profile")

So the final program I used was:

files <- list.files()

for (i in 1:length(files)) {
     author_record <- xmlParse(files[i])

     xpathApply(
          author_record, c(
               "//coredata",
               "//affiliation-current",
               "affiliation-history",
               "subject-areas",
               "//author-profile"
          ),
          addAttributes,
          auth_id = gsub("AUTHOR_ID:", "", xmlValue(author_record[["//dc:identifier"]]))
     )

     saveXML(author_record, file = files[i])
}

Old Answer: After much experimentation I found a rather simple solution to my problem.

Attributes can be added to the top node by simply using

addAttributes(xmlRoot(xmlfile), attribute = "attributeValue") 

For my specific case, the most straightforward solution will be a simple loop:

setwd("C:/directory/with/individual/xmlfiles")

files <- list.files()

for (i in 1:length(files)) {

 author_record <- xmlParse(files[i])

 addAttributes(node = xmlRoot(author_record), 
               id   = gsub   (pattern = "AUTHOR_ID:", 
                              replacement = "", 
                              x = xmlValue(auth[["//dc:identifier"]])
               )
 )

  saveXML(author_record, file = files[i])
}

I'm sure there are better ways. Clearly I need to learn XLST, that was a very powerful approach!

Sarah Hailey
  • 494
  • 5
  • 19