I'm dealing with huge XML files e.g. 75GB and more, my point is any small overhead will turn into many minutes if not hours slower processing.
The core of my code does the following while parsing a XML chunk. Let's say I have a chunk of 3 lines. Note that I only care about the a
, b
and c
attributes but there may be item elements with missing attributes e.g.
xmlvec <- c('<item a="1" c="2" x="very long whatever" />',
'<item b="3" c="4" x="very long whatever" />',
'<item a="5" b="6" c="7" x="very long whatever" />')
I define a mapping including which attributes to look up and what to rename them to, that's it, the ones I'd like to read:
mapping <- c("a", "b", "c")
# this doesn't matter here
#names(mapping) <- c("aa", "bb", "cc")
If I do the following I get missing values and/or NA column names due to the way the missing attributes affect the binding of the rows, note the missing b
column since the first item element doesn't have it:
df <- as.data.frame(do.call(
rbind,
lapply(xml_children(read_xml(paste("<xml>", paste(xmlvec, collapse=""), "</xml>"))),
function(x) {
xml_attrs(x)[mapping]
}
)
), stringsAsFactors = FALSE)
df
a NA c
1 1 <NA> 2
2 <NA> 3 4
3 5 6 7
Since attribute b
is missing in the first row of this mini chunk I get an NA
column which I can't match later to any column name. Since the first line of any chunk is arbitrary and can have any missing attributes I need to enforce the schema while reading each attribute so that the enclosing data frame doesn't get broken but this is very expensive performance-wise:
df <- as.data.frame(do.call(
rbind,
lapply(xml_children(read_xml(paste("<xml>", paste(xmlvec, collapse=""), "</xml>"))),
function(x) {
y <- xml_attrs(x)[mapping]
if (any(is.na(names(y)))) {
y <- y[-which(is.na(names(y)))]
}
y[setdiff(mapping, names(y))] <- NA
y[order(factor(names(y), levels=mapping))]
}
)
), stringsAsFactors = FALSE)
df
a b c
1 1 <NA> 2
2 <NA> 3 4
3 5 6 7
See that now the column schema and order is enforced but paying a very high penalty in performance since this is done on a per-line basis. Is there a better way?