Pair wise binary comparison - optimizing code in R

Question

I have a file that represents the gene structure of bacteria models. Each row represents a model. A row is a fixed length binary string of which genes are present (1 for present and 0 for absent). My task is to compare the gene sequence for each pair of models and get a score of how similar they are and computer a dissimilarity matrix.

In total there are 450 models (rows) in one file and there are 250 files. I have a working code however it takes roughly 1.6 hours to do the whole thing for only one file.

#Sample Data    
Generation: 0
    [0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0]
    [1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1]
    [1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0]
    [0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0]
    [0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0]
    [1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0]

What my code does:

Reads the file
Convert the binary string into a data frame Gene, Model_1, Model_2, Model_3, … Model_450
Run a nested for loop to do the pair-wise comparison (only the top half of the matrix) – I take the two corresponding columns and add them, then count the positions where the sum is 2 (meaning present in both models)
Write the data to a file
Create the matrix later

comparison code

generationFiles = list.files(pattern = "^Generation.*\\_\\d+.txt$")

start.time = Sys.time()

for(a in 1:length(generationFiles)){

  fname = generationFiles[a]

  geneData = read.table(generationFiles[a], sep = "\n", header = T, stringsAsFactors = F)

  geneCount = str_count(geneData[1,1],"[1|0]")

  geneDF <- data.frame(Gene = paste0("Gene_", c(1:geneCount)), stringsAsFactors = F)

  #convert the string into a data frame
      for(i in 1:nrow(geneData)){

    #remove the square brackets
    dataRow = substring(geneData[i,1], 2, nchar(geneData[i,1]) - 1)

    #removing white spaces
    dataRow = gsub(" ", "", dataRow, fixed = T)

    #splitting the string 
    dataRow = strsplit(dataRow, ",")

    #converting to numeric
    dataRow = as.numeric(unlist(dataRow))

    colName = paste("M_",i,sep = "")
    geneDF <- cbind(geneDF, dataRow)
    colnames(geneDF)[colnames(geneDF) == 'dataRow'] <- colName

    dataRow <- NULL
  }

  summaryDF <- data.frame(Model1 = character(), Model2 = character(), Common = integer(),
                          Uncommon = integer(), Absent = integer(), stringsAsFactors = F)

  modelNames = paste0("M_",c(1:450))

  secondaryLevel = modelNames

  fileName = paste0("D://BellosData//GC_3//Summary//",substr(fname, 1, nchar(fname) - 4),"_Summary.txt")

  for(x in 1:449){

    secondaryLevel = secondaryLevel[-1]

    for(y in 1:length(secondaryLevel)){

      result = geneDF[modelNames[x]] + geneDF[secondaryLevel[y]]

      summaryDF <- rbind(summaryDF, data.frame(Model1 = modelNames[x],
                                               Model2 = secondaryLevel[y],
                                               Common = sum(result == 2),
                                               Uncommon = sum(result == 1),
                                               Absent = sum(result == 0)))

    }


  }

  write.table(summaryDF, fileName, sep = ",", quote = F, row.names = F)
  geneDF <- NULL
  summaryDF <- NULL
  geneData <-NULL

}

converting to matrix

maxNum = max(summaryDF$Common)
  normalizeData = summaryDF[,c(1:3)]
  normalizeData[c('Common')] <- lapply(normalizeData[c('Common')], function(x) 1 - x/maxNum)

  normalizeData[1:2] <- lapply(normalizeData[1:2], factor, levels=unique(unlist(normalizeData[1:2]))) 

  distMatrixN = xtabs(Common~Model1+Model2, data=normalizeData)

  distMatrixN = distMatrixN + t(distMatrixN)

Is there a way to make the process run faster? Is there a more efficient way to do the comparison?

Just curious... For the data above, what do you expect as the results?? — Onyambu, Apr 19 '18 at 21:23
I get a data frame with three columns; model 1, model 2, similarity score. later I would use it to cluster the models. — SriniShine, Apr 19 '18 at 22:49

Vlo · Accepted Answer · 2018-04-19T21:14:07.840

This code should be faster. Nested loops are nightmare slow in R. Operations like rbind-ing one row at a time is also among the worst and slowest ideas in R programming.

Generate 450 rows with 20 elements of 0, 1 on each row.

M = do.call(rbind, replicate(450, sample(0:1, 20, replace = T), simplify = F))

Generate list of combination(450, 2) numbers of row pairs

L = split(v<-t(utils::combn(450, 2)), seq(nrow(v))); rm(v)

Apply whatever comparison function you want. In this case, the number of 1's at the same position for each row combinations. If you want to calculate different metrics, just write another function(x) where M[x[1],] is the first row and M[x[2],] is the second row.

O = lapply(L, function(x) sum(M[x[1],]&M[x[2],]))

Code takes ~4 seconds a fairly slow 2.6 Ghz Sandy Bridge

Get a clean data.frame with your results, three columns : row 1, row 2, metric between the two rows

data.frame(row1 = sapply(L, `[`, 1),
           row2 = sapply(L, `[`, 2),
           similarity_metric = do.call(rbind, O))

To be honest, I didn't thoroughly comb through your code to replicate exactly what you were doing. If this is not what you are looking for (or can't be modified to achieve what you are looking for), leave a comment.

Woah! That removes a lot of code. Thank you very much. And it is a lot faster too. — SriniShine, Apr 19 '18 at 22:46

Pair wise binary comparison - optimizing code in R

comparison code

1 Answers1