3

I have a folder containing more than 1000 files with the extension (they are no real xml files though).

I want to extract certain contents from these files automatically, so that a matrix or table is the end result (which I can use further in R for analysis, or export to 1 csv file, etc).

I have made/altered a code which works for a single file, but can't get it to work to do it automatically for the rest. By a loop?

So my code for a single file is as follows:

library(xml2)

temp <- read_xml("test.xml")
# get all the <ns2:opendataField>s
recs <- xml_find_all(temp, "//ns2:opendataField")
# extract and clean all the columns
vals <- trimws(xml_text(recs))
#create columns
cols <- xml_attr(xml_find_all(temp, "//ns2:opendataField"), "key")
#create rows
rows <- xml_attr(xml_find_all(temp, "//ns2:opendataField"), "value")
datakvk <- data.frame(cols,rows)

This results in:

 > head(datakvk)
                                              cols       rows
1                                  SbiBusinessCode      18129
2                             DocumentAdoptionDate 2017-08-22
3                                    FinancialYear       2016
4                                     BalanceSheet       <NA>
5 BalanceSheetBeforeAfterAppropriationResultsTitle       <NA>
6      BalanceSheetBeforeAfterAppropriationResults         Na
> 

In the end, with all these 1000s of files, I hope to get something like:

                                              cols       file 1   file 2
1                                  SbiBusinessCode      18129     34234
2                             DocumentAdoptionDate 2017-08-22     452454
3                                    FinancialYear       2016     2016
4                                     BalanceSheet       <NA>     2016
5 BalanceSheetBeforeAfterAppropriationResultsTitle       <NA>     <NA>
6      BalanceSheetBeforeAfterAppropriationResults         Na
> 

I tried the following code, but it didnt work:

list.files(pattern=".xml$") #

# create a list from these files
list.filenames<-list.files(pattern=".xml$")

# create an empty list that will serve as a container to receive the incoming files
list.data<-list()

# create a loop to read in your data
for (i in 1:length(list.filenames))
{
  list.data[[i]]<-read_xml(list.filenames[i])
  recs <- xml_find_all(list.data[[i]], "//ns2:opendataField")
  vals <- trimws(xml_text(recs))
  cols <- xml_attr(xml_find_all(list.data[[i]], "//ns2:opendataField"), "value")
  rows <- xml_attr(xml_find_all(list.data[[i]], "//ns2:opendataField"), "key")
}

# add the names of  data to the list
names(list.data)<-list.filenames

What am I missing? where do I go wrong?

Thanks in advance for helping me....

To be complete: (One single source file (out of 1000s looks like:)

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<opendata xmlns:ns2="http://schemas.kvk.nl/xb/query/service/2016/1/0/0">
  <ns2:opendataField key="SbiBusinessCode" value="18129"/>
  <ns2:opendataField key="DocumentAdoptionDate" value="2017-08-22"/>
  <ns2:opendataField key="FinancialYear" value="2016"/>
  <ns2:opendataField key="BalanceSheet">
    <ns2:opendataField key="BalanceSheetBeforeAfterAppropriationResultsTitle">
      <ns2:opendataField key="BalanceSheetBeforeAfterAppropriationResults" value="Na"/>
    </ns2:opendataField>
    <ns2:opendataField key="BalanceSheetTitle">
      <ns2:opendataField key="Assets" value="61296">
        <ns2:opendataField key="AssetsNoncurrent" value="8978">
          <ns2:opendataField key="IntangibleAssets" value="8978"/>
        </ns2:opendataField>
        <ns2:opendataField key="AssetsCurrent" value="52318">
          <ns2:opendataField key="Inventories" value="2239"/>
          <ns2:opendataField key="Receivables" value="40560"/>
          <ns2:opendataField key="CashAndCashEquivalents" value="9519"/>
        </ns2:opendataField>
      </ns2:opendataField>
      <ns2:opendataField key="EquityAndLiabilities" value="61296">
        <ns2:opendataField key="Equity" value="201">
          <ns2:opendataField key="ShareCapital" value="1"/>
          <ns2:opendataField key="ReservesOther" value="200"/>
        </ns2:opendataField>
        <ns2:opendataField key="LiabilitiesCurrent" value="61095"/>
      </ns2:opendataField>
    </ns2:opendataField>
  </ns2:opendataField>
</opendata>
Parfait
  • 104,375
  • 17
  • 94
  • 125
RobertHaa
  • 79
  • 2
  • 10
  • *but are not 'real' xml files* ... from the sample at bottom that is a perfectly valid XML! Please explain. – Parfait Mar 09 '18 at 15:42
  • @Parfait in my previous question i was told it was not an xml: https://stackoverflow.com/questions/49189629/900k-xml-into-r?noredirect=1#comment85386392_49189629 (And I am at like beginner level, so I just say what other said to me) – RobertHaa Mar 09 '18 at 15:45
  • It's a copy/paste issue. You just left off closing tags at the end. Commenter did not specify or check. Often XML posters tend do that and even leave out root tags that have namespaces like yours do! – Parfait Mar 09 '18 at 16:39

2 Answers2

2

Consider converting your for loop into lapply that calls data.frame() for list of dataframes. And because your XML files can potentially have different key/values, a simple cbind off a list of dataframes will not work, so use the chain merge with Reduce(), keeping all rows (i.e., full outer join) .

...
# BUILD DATAFRAME LIST
df_list <- lapply(list.filenames, function(f) {
  doc <- read_xml(f)

  setNames(data.frame(
    xml_attr(xml_find_all(doc, "//ns2:opendataField"), "key"),
    xml_attr(xml_find_all(doc, "//ns2:opendataField"), "value")
  ), c("key", f))

})

# CHAIN MERGE INTO MASTER DATAFRAME
final_df <- Reduce(function(x,y) merge(x, y, by="key", all=TRUE), df_list)
Parfait
  • 104,375
  • 17
  • 94
  • 125
0

You can use lapply and then cbind the list elements with do.call:

library(xml2)
library(dplyr)

files <- list.files(pattern = ".xml$")
data  <- lapply(files, function(x) {
  temp <- read_xml(x) %>% xml_find_all("//ns2:opendataField")
  cols <- xml_attr(xml_find_all(temp, "//ns2:opendataField"), "key")
  rows <- xml_attr(xml_find_all(temp, "//ns2:opendataField"), "value")
  out  <- data.frame(rows, row.names = cols)
  names(out) <- x
  out
})
do.call(cbind, data)

Output for two files with the same content:

                                                  file1.xml  file2.xml
SbiBusinessCode                                       18129      18129
DocumentAdoptionDate                             2017-08-22 2017-08-22
FinancialYear                                          2016       2016
BalanceSheet                                           <NA>       <NA>
BalanceSheetBeforeAfterAppropriationResultsTitle       <NA>       <NA>
BalanceSheetBeforeAfterAppropriationResults              Na         Na
BalanceSheetTitle                                      <NA>       <NA>
Assets                                                61296      61296
AssetsNoncurrent                                       8978       8978
IntangibleAssets                                       8978       8978
AssetsCurrent                                         52318      52318
Inventories                                            2239       2239
Receivables                                           40560      40560
CashAndCashEquivalents                                 9519       9519
EquityAndLiabilities                                  61296      61296
Equity                                                  201        201
ShareCapital                                              1          1
ReservesOther                                           200        200
LiabilitiesCurrent                                    61095      61095
Martin Schmelzer
  • 23,283
  • 6
  • 73
  • 98
  • Thanks Martin, but I get the following error: `Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 19, 16, 23, 21, 12, 22, 24, 14, 15, 18, 17, 20, 31` Could you advise? – RobertHaa Mar 09 '18 at 16:02
  • Yes, your files do not have a homogenous structure. So cb – Martin Schmelzer Mar 10 '18 at 12:12