In order to migrate XML data into two-dimensions of rows and columns according to the structure of datasets and dataframes, all nests must be removed to only iterating parent and one child level. Therefore, XSLT, the special-purpose declarative programming language that restructures XML documents to any nuanced needs, comes in handy to restructure XML data for end use needs.
Given your example XML, below is an XSLT that can be run and resulting XML be successfully imported into SAS. Have the SAS code looped to restructure all thousands of XML files.
XSLT (save as .xsl or .xslt format)
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
xmlns:ait="http://www.elsevier.com/xml/ani/ait"
xmlns:ce="http://www.elsevier.com/xml/ani/common"
xmlns:cto="http://www.elsevier.com/xml/cto/dtd"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:ns1="http://webservices.elsevier.com/schemas/search/fast/types/v4"
xmlns:prism="http://prismstandard.org/namespaces/basic/2.0/"
xmlns:xocs="http://www.elsevier.com/xml/xocs/dtd"
xmlns:xoe="http://www.elsevier.com/xml/xoe/dtd"
exclude-result-prefixes="ait ce cto dc ns1 prism xocs xoe">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:template match="author-retrieval-response">
<xsl:variable select="substring-after(coredata/dc:identifier, ':')" name="authorid"/>
<root>
<coredata>
<authorid><xsl:value-of select="$authorid"/></authorid>
<xsl:for-each select="coredata/*">
<xsl:element name="{local-name()}">
<xsl:value-of select="concat(.,@href)"/>
</xsl:element>
</xsl:for-each>
</coredata>
<subjectAreas>
<authorid><xsl:value-of select="$authorid"/></authorid>
<xsl:for-each select="subject-areas/*">
<xsl:element name="{local-name()}">
<xsl:value-of select="."/>
</xsl:element>
</xsl:for-each>
</subjectAreas>
<authorname>
<authorid><xsl:value-of select="$authorid"/></authorid>
<xsl:for-each select="author-profile/preferred-name/*">
<xsl:element name="{local-name()}">
<xsl:value-of select="."/>
</xsl:element>
</xsl:for-each>
</authorname>
<classifications>
<authorid><xsl:value-of select="$authorid"/></authorid>
<xsl:for-each select="author-profile/classificationgroup/classifications/*">
<xsl:element name="{local-name()}">
<xsl:value-of select="."/>
</xsl:element>
</xsl:for-each>
</classifications>
<journals>
<authorid><xsl:value-of select="$authorid"/></authorid>
<xsl:for-each select="author-profile/journal-history/journal/*">
<xsl:element name="{local-name()}">
<xsl:value-of select="."/>
</xsl:element>
</xsl:for-each>
</journals>
<ipdoc>
<authorid><xsl:value-of select="$authorid"/></authorid>
<xsl:for-each select="author-profile/affiliation-current/affiliation/ip-doc/*[not(local-name()='address')]">
<xsl:element name="{local-name()}">
<xsl:value-of select="."/>
</xsl:element>
</xsl:for-each>
</ipdoc>
<address>
<authorid><xsl:value-of select="$authorid"/></authorid>
<xsl:for-each select="author-profile/affiliation-current/affiliation/ip-doc/address/*">
<xsl:element name="{local-name()}">
<xsl:value-of select="."/>
</xsl:element>
</xsl:for-each>
</address>
</root>
</xsl:template>
</xsl:transform>
SAS (using above script)
proc xsl
in="C:\Path\To\Original.xml"
out="C:\Path\To\Output.xml"
xsl="C:\Path\To\XSLT.xsl";
run;
** STORING XML CONTENT;
libname temp xml 'C:\Path\To\Output.xml';
** APPEND CONTENT TO SAS DATASETS;
data Work.Coredata;
retain authorid;
set temp.Coredata; ** NAME OF PARENT NODE IN XML;
run;
data Work.SubjectAreas;
retain authorid;
set temp.SubjectAreas; ** NAME OF PARENT NODE IN XML;
run;
data Work.Authorname;
retain authorid;
set temp.Authorname; ** NAME OF PARENT NODE IN XML;
run;
data Work.Classifications;
retain authorid;
set temp.Classifications; ** NAME OF PARENT NODE IN XML;
run;
data Work.Journals;
retain authorid;
set temp.Journals; ** NAME OF PARENT NODE IN XML;
run;
data Work.Ipdoc;
retain authorid;
set temp.Ipdoc; ** NAME OF PARENT NODE IN XML;
run;
XML OUTPUT (which is imported as Authorsdata dataset of one row and 40 variables)
<?xml version="1.0" encoding="UTF-8"?>
<root>
<coredata>
<authorid>1234567</authorid>
<url>http://api.elsevier.com/content/author/author_id/1234567</url>
<identifier>AUTHOR_ID:1234567</identifier>
<eid>9-s2.0-1234567</eid>
<document-count>3</document-count>
<cited-by-count>95</cited-by-count>
<citation-count>97</citation-count>
<link>http://api.elsevier.com/content/search/scopus?query=refauid%1234567%29</link>
<link>http://www.scopus.com/authid/detail.url?partnerID=HzOxMe3b&authorId=1234567&origin=inward</link>
<link>http://api.elsevier.com/content/author/author_id/1234567</link>
<link>http://api.elsevier.com/content/search/scopus?query=au-id%281234567%29</link>
</coredata>
<subjectAreas>
<authorid>1234567</authorid>
<subject-area>Human-Computer Interaction</subject-area>
<subject-area>Control and Systems Engineering</subject-area>
<subject-area>Software</subject-area>
<subject-area>Computer Vision and Pattern Recognition</subject-area>
<subject-area>Artificial Intelligence</subject-area>
</subjectAreas>
<authorname>
<authorid>1234567</authorid>
<initials>A.</initials>
<indexed-name>John A.</indexed-name>
<surname>John</surname>
<given-name>Doe</given-name>
</authorname>
<classifications>
<authorid>1234567</authorid>
<classification>1709</classification>
<classification>2207</classification>
<classification>1712</classification>
<classification>1707</classification>
<classification>1702</classification>
</classifications>
<journals>
<authorid>1234567</authorid>
<sourcetitle>Very Prestigious Journal</sourcetitle>
<sourcetitle-abbrev>V PRES JOU Autom</sourcetitle-abbrev>
<issn>10504729</issn>
<sourcetitle>2005 Another Prestigious Journal</sourcetitle>
<sourcetitle-abbrev>An. Prest. Jou. </sourcetitle-abbrev>
</journals>
<ipdoc>
<authorid>1234567</authorid>
<afnameid>Prestigious University#1111111</afnameid>
<afdispname>Prestigious University University</afdispname>
<preferred-name>Prestigious University University</preferred-name>
<sort-name>Prestigious University</sort-name>
<org-domain>pu.edu</org-domain>
<org-URL>http://www.pu.edu/index.shtml</org-URL>
</ipdoc>
<address>
<authorid>1234567</authorid>
<address-part>1234 Prestigious Lane</address-part>
<city>City</city>
<state>ST</state>
<postal-code>12345</postal-code>
<country>United States</country>
</address>
</root>
R ALTERNATIVE
Since no comprehensive R XSLT library exists, parsing will have to be done directly in R language. However, R can call XSLT processors of other executables (i.e. Python, Saxon, VBA) through command line, RCOMClient package, and other interfaces.
Nonetheless, R can extract XML data by xmlToDataFrame()
and xpathSApply()
(the latter being similar to XPath) for the authorid
:
library(XML)
coredata <- xmlToDataFrame(nodes = getNodeSet(doc, '//coredata'))
coredata$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])
subjectareas <- xmlToDataFrame(nodes = getNodeSet(doc, "//subject-areas"))
subjectareas$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])
authorname <- xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/preferred-name'))
authorname$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])
classifications <- xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/classificationgroup/classifications'))
classifications$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])
journal <- xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/journal-history/journal'))
journal$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])
ipdoc <- xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/affiliation-current/affiliation/ip-doc'))
ipdoc$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])
address <- xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/affiliation-current/affiliation/ip-doc/address'))
address$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])