2

I have inherited a function to run a fuzzy match between two sets of names using the stringdist package to calculate the distance between two string variables and select the match with the smallest distance.

This is fine and wonderful and works great for my needs except I have a dataframe of 200,000 candidate names to run against a reference set of 1.35 million names. At ~6 seconds per name, this is going to get out of hand.

I have access to a cluster, so as I see it the fastest way to crack the nut is to batch process it: split the candidate names into say 50 subsets of 4000 names and run them in parallel. Ideally I would write a parallel batch script from the ground up, but I have no experience and time is limited.

Is there any way to take the existing script, put a wrapper around it that subsets the input, feeds it to the script, and then runs them in parallel? Even if it returns 50 separate files I would be happy to stitch them together myself.

The existing code is:

    library(stringdist)
    # set up function
    fuzzymatch<-function(dat1,dat2,string1,string2,meth,id1,id2){
  #initialize Variables:
  matchfile <-NULL #iterate appends
  x<-nrow(dat1) #count number of rows in input, for max number of runs
  
  #Check to see if function has ID values. Allows for empty values for ID variables, simple list match
  if(missing(id1)){id1=NULL}
  if(missing(id2)){id2=NULL}
  
#Loop through dat1 dataset iteratively. This is a work around to allow for large datasets to be matched
  #Can run as long as dat2 dataset fits in memory. Avoids full Cartesian join.
  for(i in 1:x) {
    d<-merge(dat1[i,c(string1,id1), drop=FALSE],dat2[,c(string2,id2), drop=FALSE])#drop=FALSE to preserve 1var dataframe
    
    #Calculate String Distatnce based method specified "meth"
    d$dist <- stringdist(d[,string1],d[,string2], method=meth, nthread=4)
    
    #dedupes A_names selects on the smallest distatnce.
    d<- d[order(d[,string1], d$dist, decreasing = FALSE),]
    d<- d[!duplicated(d[,string1]),]
    
    #append demos on matched file
    matchfile <- rbind(matchfile,d)
    #print(paste(round(i/x*100,2),"% complete",sep=''))
  }
  return(matchfile)
}

Matched <- fuzzymatch(unmatched.names,Master.Taxonomy,"Submitted_Name","scientificName",meth="jw")
head(example)

Toy data:

    unmatched.names <- data.frame(
'Submitted_Name' = c(
  'Phragmites australis (Cav.) Steud.',
  'Molinia caerulea Moench',
  'Potentilla erecta (L.) Räuschel',
  'Cistus ladanifer subsp. ladanifer',
  'Cirsium palustre Scop.',
  'Dryopteris filix-mas (L.) Schott.',
  'Glyceria fluitans R.Br.',
  'Luzula campestris DC.',
  'Rosmarinus officinalis de Noë ex Lange',
  'Cerastium fontanum subsp. vulgare (Hartm.) Greuter & Burdet'))
   Master.Taxonomy <- data.frame(
  'scientificName' = c(
    'Phragmites australis (Cav.) Trin. ex Steud.',
    'Molinia caerulea (L.) Moench',
    'Potentilla erecta (L.) Raeusch.',
    'Cistus ladanifer subsp. africanus Dans.',
    'Cirsium pygmaeum Scop.',
    'Dryopteris filix-mas (L.) Schott',
    'Glyceria fluitans (L.) R.Br.',
    'Luzula campestris (L.) DC.',
    'Rosmarinus officinalis L.',
    'Cerastium fontanum subsp. vulgare (Hartman) Greuter & Burdet'))
    
    ```

halfer
  • 19,824
  • 17
  • 99
  • 186

0 Answers0