0

I have this dataset that I'm trying to parse in R. The data from HMDB and the dataset name is Serum Metabolites (in a format of xml file). The xml file contains about 25K metabolites nodes, each I want to parse to sub-nodes

I have a code that parses the XML file to a list object in R. Since the XML file is quite big and since for each metabolite there are about 12 sub-nodes I want, It takes a long time to parse the file. about 3 hours to 1,000 metabolites. I'm trying to use the package parallel but receive and error.

The packages:

library("XML")
library("xml2")
library( "magrittr" )  #for pipe operator %>%
library("pbapply") # to track on progress  
library("parallel") 

The function:

# The function receives an XML file (its location) and returns a list of nodes
 
Short_Parser_HMDB <- function(xml.file_location){
  start.time<- Sys.time()
  # Read as xml file
  doc <- read_xml( xml.file_location )
  #get metabolite nodes (only first three used in this sample)
  
  met.nodes <- xml_find_all( doc, ".//d1:metabolite" )  [1:1000] # [(i*1000+1):(1000*i+1000)]  # [1:3]  
  #list of data.frame
  xpath_child.v <- c( "./d1:accession",
                      "./d1:name"  ,
                      "./d1:description",
                      "./d1:synonyms/d1:synonym"  ,
                      "./d1:chemical_formula"   ,
                      "./d1:smiles" ,
                      "./d1:inchikey"    ,
                      "./d1:biological_properties/d1:pathways/d1:pathway/d1:name"   ,
                      "./d1:diseases/d1:disease/d1:name"   ,
                      "./d1:diseases/d1:disease/d1:references",
                      
                      "./d1:kegg_id"   ,                
                      "./d1:meta_cyc_id"
  )
  
  child.names.v <- c( "accession",
                      "name" ,  
                      "description" ,
                      "synonyms"  ,
                      "chemical_formula" , 
                      "smiles" ,
                      "inchikey"  , 
                      "pathways_names" ,
                      "diseases_name",
                      "references",
                      
                      "kegg_id" , 
                      "meta_cyc_id"
  ) 
  #first, loop over the met.nodes
  L.sec_acc <- parLapply(cl, met.nodes, function(x) {   # pblapply to track progress or lapply but slows down dramticlly the function  and parLapply fo parallel 
    #second, loop over the xpath desired child-nodes
    temp <-  parLapply(cl, xpath_child.v, function(y) { 
      xml_find_all(x, y ) %>% xml_text(trim = T) %>% data.frame( value = .)
    })
    #set their names
    names(temp) = child.names.v
    return(temp)
  }) 
  end.time<- Sys.time()
  total.time<- end.time-start.time
  print(total.time)
  return(L.sec_acc )
    
}

Now create the enviroment :

# select the location where the XML file is 
location= "D:/path/to/file//HMDB/DataSets/serum_metabolites/serum_metabolites.xml"


cl <-makeCluster(detectCores(), type="PSOCK")
clusterExport(cl, c("Short_Parser_HMDB", "cl"))
clusterEvalQ(cl,{library("parallel") 
                library("magrittr")
                library("XML")
                library("xml2")
  })

And execute :

Short_outp<-Short_Parser_HMDB(location)
stopCluster(cl)

The error received:

> Short_outp<-Short_Parser_HMDB(location)
Error in checkForRemoteErrors(val) : 
  one node produced an error: invalid connection

base on those links, Tried to implement the parallel :

  1. Parallel Processing in R
  2. How to call global function from the parLapply function?
  3. Error in R parallel:Error in checkForRemoteErrors(val) : 2 nodes produced errors; first error: cannot open the connection

but couldn't find invalid connection as an error

I'm using windows 10 the latest R version 4.0.2 (not sure if it's enough information)

Any hint or idea will be appreciated

TaL
  • 173
  • 2
  • 15
  • First of all, you're not just paralleling. You are nesting your parallel execution. This is a bad idea, unless you're very careful, so one loop (likely the inner loop) should be changed into a sequential loop. I'd expect a lot of errors by exporting the nodes of a parallel cluster into another cluster. – Oliver Aug 24 '20 at 13:15
  • Thank you @Oliver - So I tried to parallel only the outer apply (keep it parLapply) and the inner one change to a regular one (lapply) but now getting different error: Error in checkForRemoteErrors(val) : 16 nodes produced errors; first error: object 'xpath_child.v' of mode 'function' was not found. Could you suggest some other approach to handle the time it takes to code to run? – TaL Aug 25 '20 at 07:46
  • The error is similar to a problem asked [here](https://stackoverflow.com/q/51030181/10782538). It would seem that something is calling `xpath_child.v` as a function in your code, while it is a variable. Did you place this in the spot of `FUN` in `lapply`? – Oliver Aug 25 '20 at 08:10
  • You were right, I changed the inner from parLapply > to lapply but didn't remove the `cl` variable. Thank You! However, now there is a new error: Error in checkForRemoteErrors(val) : 16 nodes produced errors; first error: external pointer is not valid – TaL Aug 25 '20 at 08:59
  • Now that one is harder to see simply. From the sound of it, I would guess that it has something to do with how the object (My guess is `met.nodes`) is stored, from the package side. Looking into it [this question](https://stackoverflow.com/questions/55810140/parallel-processing-xml-nodes-with-r) seems to suggest this as well. I would suggest trying to find other questions (and solutions) to parallel implementations on xml/xml2 objects in R, as my knowledge sadly is running out at this point. – Oliver Aug 25 '20 at 09:14

0 Answers0