3

I have the following XML file (I am missing the root node but the editor is not allowing me--please assume there is a root node here):

<Indvls>
    <Indvl>
        <Info lastNm="HANSON" firstNm="LAURIE"/>
        <CrntEmps>
            <CrntEmp orgNm="ABC INCORPORATED" str1="FOURTY FOUR BRYANT PARK" city="NEW YORK" state="NY" cntry="UNITED STATES" postlCd="10036">
            <BrnchOfLocs>
                <BrnchOfLoc str1="833 NE 55TH ST" city="BELLEVUE" state="WA" cntry="UNITED STATES" postlCd="98004"/>
            </BrnchOfLocs>
            </CrntEmp>
        </CrntEmps>
    </Indvl>
    <Indvl>
        <Info lastNm="JACKSON" firstNm="SHERRY"/>
        <CrntEmps>
            <CrntEmp orgNm="XYZ INCORPORATED" str1="3411 GEORGE STREET" city="SAN FRANCISCO" state="CA" cntry="UNITED STATES" postlCd="94105">
            <BrnchOfLocs>
            </BrnchOfLocs>
            </CrntEmp>
        </CrntEmps>
    </Indvl>
</Indvls>

Using R, I want to extract the following columns in the form of a table: (a) lastNm and firstNm from /Info node--always present with values; (b) orgNm from /CrntEmps/CrntEmp node--always present with values; and (c) str1, city, state from /CrntEmps/BrnchOfLocs/BrnchofLoc node--may or may not come with values (in my example the second entity does NOT have an office location address).

My challenge is that many nodes will not have the BrnchOfLoc node. I want to create an entry even if the nodes are missing (otherwise the table is unbalanced and I get an error while creating it in a data frame).

Any thoughts or suggestions? I appreciate any inputs.

Addendum: Here is my code:

xmlGetNodeAttr <- function(n, xp, attr, default=NA) {
ns<-getNodeSet(n, xp)
if(length(ns)<1) {
    return(default)
} else {
    sapply(ns, xmlGetAttr, attr, default)
}
}

do.call(rbind, lapply(xmlChildren(xmlRoot(doc)), function(x) {
data.frame(
    fname=xmlGetNodeAttr(x, "//Info","firstNm",NA),
    lname=xmlGetNodeAttr(x, "//Info","lastNm",NA),
  orgname=xmlGetNodeAttr(x,"//CrntEmps/CrntEmp[1]","orgNm",NA),
    zip=xmlGetNodeAttr(x, "//CrntEmps/CrntEmp[1]/BrnchOfLocs/BrnchOfLoc[1]","city",NA)
)
}))

1 Answers1

2

You should be doing

do.call(rbind, lapply(xmlChildren(xmlRoot(doc)), function(x) {
data.frame(
    fname=xmlGetNodeAttr(x, "./Info","firstNm",NA),
    lname=xmlGetNodeAttr(x, "./Info","lastNm",NA),
    orgname=xmlGetNodeAttr(x, "./CrntEmps/CrntEmp[1]","orgNm",NA),
    zip=xmlGetNodeAttr(x, "./CrntEmps/CrntEmp[1]/BrnchOfLocs/BrnchOfLoc[1]","city",NA)
)
}))

Note the use of ./ rather than //. The latter will search across the entire document, ignoring the current node that you are lapply-ing over. Using ./ will start with the current x node and only look at descendants. This returns

        fname   lname          orgname      zip
Indvl  LAURIE  HANSON ABC INCORPORATED BELLEVUE
Indvl1 SHERRY JACKSON XYZ INCORPORATED     <NA>
MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • I am facing situations where the entire BrnchOfLocs node AND the sub-node (BrnchofLoc) is missing. This code is running into issues at that point. Thoughts? I really appreciate your inputs on this issue. Thank you. – user3808860 Aug 17 '14 at 05:35
  • It seems to work exactly the same in that case. What "issues" are you running into? – MrFlick Aug 17 '14 at 05:37
  • Alright - let me try a few variations in my XML file and in my code to see if I still run into the current issue (just shows one row, all with NA values). Thanks again. – user3808860 Aug 17 '14 at 05:39
  • Well, it works fine if i take out `` from the sample above. Make sure your sample data recreates the problem you are having. – MrFlick Aug 17 '14 at 05:40
  • MrFlick - I used a slightly different approach as I had to have a lot more flexibility. I will post my solution a little later. Thanks a lot for your help. – user3808860 Sep 01 '14 at 15:50
  • 2
    @user3808860 I think you forgot to post your solution – Hack-R Sep 19 '16 at 09:08