I have developed an R script that correctly extracts selected data from small (<2 MB) XML files. This script involves reading the entire file into memory. However, now I am trying to apply this script to a much larger 624-MB XML file, and have encountered the following issues:
~ If I try to run it on my laptop, CPU and memory usage shoots up to 100%, and I am nervous about running the job on this platform, so I killed it.
~ I have tried to run it on the CoCalc cloud computing platform, but I have encountered problems with the R XML parser, so that this job doesn’t even start running.
~ I'm not sure whether reading the full file into memory remains a feasible option, or whether I will need to revise my code to handle only much smaller subsets of the full XML file at one time.
I have been researching options that might allow me to make simple alterations to my code to allow the huge file to be read either one line at a time or in chunks, but am unclear about the best option. Some of the descriptions I have seen, e.g. for using SAX processing, seem to suggest that this code would need to be rewritten at a very low level that would not use the XML file’s hierarchical structure, and that would require low-level data-handling functions to be written. I am trying to avoid this.
The most promising option seems to be the xml_siblings() and/or other related functions in the R XML package. Ideally, I would like to call one of these functions within a loop, and extract a single node each time this function is called, so that I can process a single node at a time.
However, when I call any of these functions using the provided syntax (and, for each tested function, following the documentation's syntax guidelines), I always get the following error:
# try to extract Node 1 from the xmldata file: # library(XML) # library(xml2) # filename = "SmallTestFile.xml" # xmldata = xmlRoot(xmlTreeParse(filename)) > TestSiblings <- xml_siblings(xmldata) Error in UseMethod("nodeset_apply") : no applicable method for 'nodeset_apply' applied to an object of class "c('XMLNode', 'RXMLAbstractNode', 'XMLAbstractNode', 'oldClass')"
I have searched around, but not yet located a useful resource to inform the troubleshooting of the above error message.
I have also received advice that I might want to switch from R to Python, e.g. to use Beautiful Soup. I will do this if necessary, but would strongly prefer to just adjust my existing R code if possible.
Thanks in advance for any guidance you can provide.
library(XML)
library(xml2)
library(gdata)
filename = "HugeFile.xml"
# Save the database file as a tree structure
xmldata = xmlRoot(xmlTreeParse(filename))
# Number of nodes in the entire database file
NumNodes <- xmlSize(xmldata)
# read file into variable
MyData <- read_xml(filename)
# strip out the namespace; this can make the data easier to work with
xml_ns_strip(MyData)
# locate all items [i.e. nodes] within the data set
items <- xml_find_all(MyData, './item');
row_count <- 1
TotalNumberOfSubitems <- length(xml_find_all(itemss, './subitems/subitem'));
item.name <- array(, dim=c(TotalNumberOfSubitems,1))
# for each drug
for (item_num in 1:length(items)) {
# call xml_find_all(), xml_find_first(), and xml_text() functions to extract info;
# e.g., record the drug's name:
item.name[row_count] <- xml_text(xml_find_first(current_item, './name'));
…
}
# Create composite matrix that holds all variables being reported
CompositeMatrix = cbind(item.name,value2,value3,value4)
# Specify column names
colnames(CompositeMatrix) <- c("Item Name", “Value 2”, “Value 3”, “Value 4”)
# Write to output file with column headers... BUT these are misaligned with the rest of the columns...
write.fwf(CompositeMatrix,file="OutputList.txt",sep="\t", quote=F, rownames=FALSE, colnames=TRUE)