Best way to parse large XML file in R without reading entire file into memory?

Question

I have developed an R script that correctly extracts selected data from small (<2 MB) XML files. This script involves reading the entire file into memory. However, now I am trying to apply this script to a much larger 624-MB XML file, and have encountered the following issues:

~ If I try to run it on my laptop, CPU and memory usage shoots up to 100%, and I am nervous about running the job on this platform, so I killed it.

~ I have tried to run it on the CoCalc cloud computing platform, but I have encountered problems with the R XML parser, so that this job doesn’t even start running.

~ I'm not sure whether reading the full file into memory remains a feasible option, or whether I will need to revise my code to handle only much smaller subsets of the full XML file at one time.

I have been researching options that might allow me to make simple alterations to my code to allow the huge file to be read either one line at a time or in chunks, but am unclear about the best option. Some of the descriptions I have seen, e.g. for using SAX processing, seem to suggest that this code would need to be rewritten at a very low level that would not use the XML file’s hierarchical structure, and that would require low-level data-handling functions to be written. I am trying to avoid this.

The most promising option seems to be the xml_siblings() and/or other related functions in the R XML package. Ideally, I would like to call one of these functions within a loop, and extract a single node each time this function is called, so that I can process a single node at a time.

However, when I call any of these functions using the provided syntax (and, for each tested function, following the documentation's syntax guidelines), I always get the following error:

# try to extract Node 1 from the xmldata file:
# library(XML)
# library(xml2)
# filename = "SmallTestFile.xml"
# xmldata = xmlRoot(xmlTreeParse(filename))
> TestSiblings <- xml_siblings(xmldata)
Error in UseMethod("nodeset_apply") : 
  no applicable method for 'nodeset_apply' applied to an object of class "c('XMLNode', 'RXMLAbstractNode', 'XMLAbstractNode', 'oldClass')"

I have searched around, but not yet located a useful resource to inform the troubleshooting of the above error message.

I have also received advice that I might want to switch from R to Python, e.g. to use Beautiful Soup. I will do this if necessary, but would strongly prefer to just adjust my existing R code if possible.

Thanks in advance for any guidance you can provide.

library(XML)
library(xml2)
library(gdata)

filename = "HugeFile.xml"

# Save the database file as a tree structure
xmldata = xmlRoot(xmlTreeParse(filename))

# Number of nodes in the entire database file
NumNodes <- xmlSize(xmldata)

# read file into variable
MyData <- read_xml(filename)

# strip out the namespace; this can make the data easier to work with
xml_ns_strip(MyData)

# locate all items [i.e. nodes] within the data set
items <- xml_find_all(MyData, './item');

row_count <- 1

TotalNumberOfSubitems <- length(xml_find_all(itemss, './subitems/subitem'));

item.name <- array(, dim=c(TotalNumberOfSubitems,1))


# for each drug
for (item_num in 1:length(items)) {

  # call xml_find_all(), xml_find_first(), and xml_text() functions to extract info;
  # e.g., record the drug's name:
  item.name[row_count] <- xml_text(xml_find_first(current_item, './name'));

  …

}

# Create composite matrix that holds all variables being reported
CompositeMatrix = cbind(item.name,value2,value3,value4)

# Specify column names
colnames(CompositeMatrix) <- c("Item Name", “Value 2”, “Value 3”, “Value 4”)

# Write to output file with column headers... BUT these are misaligned with the rest of the columns...
write.fwf(CompositeMatrix,file="OutputList.txt",sep="\t", quote=F, rownames=FALSE, colnames=TRUE)

I am inclined to think that external pre-processing with XSLT might help: https://stackoverflow.com/questions/44257890/parsing-large-and-complicated-xml-file-to-data-frame?rq=1 — dmi3kno, Jul 29 '17 at 19:22
Thanks for your response. I tested out the xmlToList() function and it works, but I'm not sure that it addresses my central issue, which is the need to access only one node at a time. I am going to continue trying to get xml_siblings() working and also consider getNodeSet() to try to access each node sequentially... — Bob Loblaw, Jul 30 '17 at 00:35
Try command line tools, for large (>300Mb).csv files I have successfully used `awk` to filter interesting content before reading the csv file. For large xml files, you might try `xmllint` to extract content. — Paul Rougieux, Jul 30 '17 at 12:35
Thanks for your suggestion, Paul. If I need to run my entire R script as an R Markdown Notebook, would your suggestion to use command-line processing be compatible? As far as I'm aware, having searched through documentation, I have not yet seen that this can be done... — Bob Loblaw, Jul 30 '17 at 15:48

Best way to parse large XML file in R without reading entire file into memory?

0 Answers0