Huge data file and running multiple parameters and memory issue, Fisher's test

Question

I have a R code that I am trying to run in a server. But it is stopping in the middle/get frozen probably because of memory limitation. The data files are huge/massive (one has 20 million lines) and if you look at the double for loop in the code, length(ratSplit) = 281 and length(humanSplit) = 36. The data has specific data of human and rats' genes and human has 36 replicates, while rat has 281. So, the loop is basically 281*36 steps. What I want to do is to process data using the function getGeneType and see how different/independent are the expression of different replicate combinations. Using Fisher's test. The data rat_processed_7_25_FDR_05.out looks like this :

2       Sptbn1  114201107       114200202       chr14|Sptbn1:114201107|Sptbn1:114200202|reg|-   2       Thymus_M_GSM1328751     reg
2       Ndufb7  35680273        35683909        chr19|Ndufb7:35680273|Ndufb7:35683909|reg|+     2       Thymus_M_GSM1328751     rev
2       Ndufb10 13906408        13906289        chr10|Ndufb10:13906408|Ndufb10:13906289|reg|-   2       Thymus_M_GSM1328751     reg
3       Cdc14b  1719665 1719190 chr17|Cdc14b:1719665|Cdc14b:1719190|reg|-       3       Thymus_M_GSM1328751     reg

and the data fetal_output_7_2.out has the form

SPTLC2  78018438        77987924        chr14|SPTLC2:78018438|SPTLC2:77987924|reg|-     11      Fetal_Brain_408_AGTCAA_L006_R1_report.txt       reg
EXOSC1  99202993        99201016        chr10|EXOSC1:99202993|EXOSC1:99201016|rev|-     5       Fetal_Brain_408_AGTCAA_L006_R1_report.txt       reg
SHMT2   57627893        57628016        chr12|SHMT2:57627893|SHMT2:57628016|reg|+       8       Fetal_Brain_408_AGTCAA_L006_R1_report.txt       reg
ZNF510  99538281        99537128        chr9|ZNF510:99538281|ZNF510:99537128|reg|-      8       Fetal_Brain_408_AGTCAA_L006_R1_report.txt       reg
PPFIBP1 27820253        27824363        chr12|PPFIBP1:27820253|PPFIBP1:27824363|reg|+   10      Fetal_Brain_408_AGTCAA_L006_R1_report.txt       reg

Now I have few questions on how to make this more efficient. I think when I run this code, R takes up lots of memory that ultimately causes problems. I am wondering if there is any way of doing this more efficiently

Another possibility is the usage of double for-loop'. Will sapply help? In that case, how should I apply sapply?

At the end I want to convert result into a csv file. I know this is a bit overwhelming to put code like this. But any optimization/efficient coding/programming will be A LOT! I really need to run the whole thing at least one to get the data soon.

#this one compares reg vs rev
date()
ratRawData <- read.table("rat_processed_7_25_FDR_05.out",col.names = c("alignment", "ratGene", "start", "end", "chrom", "align", "ratReplicate", "RNAtype"), fill = TRUE)
humanRawData <- read.table("fetal_output_7_2.out", col.names = c("humanGene", "start", "end", "chrom", "alignment", "humanReplicate", "RNAtype"), fill = TRUE)
geneList <- read.table("geneList.txt", col.names = c("human", "rat"), sep = ',')
#keeping only information about gene, alignment number, replicate and RNAtype, discard other columns
ratRawData <- ratRawData[,c("ratGene", "ratReplicate", "alignment", "RNAtype")]
humanRawData <- humanRawData[, c( "humanGene", "humanReplicate", "alignment", "RNAtype")]

#function to capitalize
capitalize <- function(x){
  capital <- toupper(x) ## capitalize 
  paste0(capital)
}

#capitalizing the rna type naming for rat. So, reg ->REG, dup ->DUP, rev ->REV
#doing this to make data manipulation for making contingency table easier.
levels(ratRawData$RNAtype) <- capitalize(levels(ratRawData$RNAtype))

#spliting data in replicates
ratSplit <- split(ratRawData, ratRawData$ratReplicate)
humanSplit <- split(humanRawData, humanRawData$humanReplicate)

print("done splitting")
#HyRy :when some gene has only reg, rev , REG, REV
#HnRy : when some gene has only reg,REG,REV
#HyRn : add 1 when some gene has only reg,rev,REG
#HnRn : add 1 when some gene has only reg,REG

#function to be used to aggregate 
getGeneType <- function(types) {
  types <- as.character(types)
  if ('rev' %in% types) {
    return(ifelse(('REV' %in% types), 'HyRy', 'HyRn'))
  } 
  else {
    return(ifelse(('REV' %in% types), 'HnRy', 'HnRn'))
  }
}

#logical function to see whether x is integer(0) ..It's used the for loop bellow in case any one HmYn is equal to zero
is.integer0 <- function(x) {
  is.integer(x) && length(x) == 0L
}

result <- data.frame(humanReplicate = "human_replicate", ratReplicate = "rat_replicate", pvalue = "p-value", alternative = "alternative_hypothesis", 
                     Conf.int1 = "conf.int1", Conf.int2 ="conf.int2", oddratio = "Odd_Ratio")
for(i in 1:length(ratSplit)) {
  for(j in 1:length(humanSplit)) {
    ratReplicateName <- names(ratSplit[i])
    humanReplicateName <- names(humanSplit[j])

    #merging above two based on the one-to-one gene mapping as in geneList defined above.
    mergedHumanData <-merge(geneList,humanSplit[[j]], by.x = "human", by.y = "humanGene")
    mergedRatData <- merge(geneList, ratSplit[[i]], by.x = "rat", by.y = "ratGene")

    mergedHumanData <- mergedHumanData[,c(1,2,4,5)] #rearrange column
    mergedRatData <- mergedRatData[,c(2,1,4,5)]  #rearrange column
    mergedHumanRatData <- rbind(mergedHumanData,mergedRatData) #now the columns are "human", "rat", "alignment", "RNAtype"

    agg <- aggregate(RNAtype ~ human+rat, data= mergedHumanRatData, FUN=getGeneType) #agg to make HmYn form
    HmRnTable <- table(agg$RNAtype) #table of HmRn ie RNAtype in human and rat.

    #now assign these numbers to variables HmYn. Consider cases when some form of HmRy is not present in the table. That's why
    #is.integer0 function is used
    HyRy <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HyRy"]), 0, HmRnTable[names(HmRnTable) == "HyRy"][[1]])
    HnRn <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HnRn"]), 0, HmRnTable[names(HmRnTable) == "HnRn"][[1]])
    HyRn <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HyRn"]), 0, HmRnTable[names(HmRnTable) == "HyRn"][[1]])
    HnRy <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HnRy"]), 0, HmRnTable[names(HmRnTable) == "HnRy"][[1]])

    contingencyTable <- matrix(c(HnRn,HnRy,HyRn,HyRy), nrow = 2)
    # contingencyTable: 
    # HnRn --|--HyRn
    # |------|-----|
    # HnRy --|-- HyRy
    #

    fisherTest <- fisher.test(contingencyTable)

    #make new line out of the result of fisherTest
    newLine <- data.frame(t(c(humanReplicate = humanReplicateName, ratReplicate = ratReplicateName, pvalue = fisherTest$p,
                              alternative = fisherTest$alternative, Conf.int1 = fisherTest$conf.int[1], Conf.int2 =fisherTest$conf.int[2], 
                              oddratio = fisherTest$estimate[[1]])))

    result <-rbind(result,newLine) #append newline to result
    if(j%%10 = 0) print(c(i,j))
  }
}

write.table(result, file = "compareRegAndRev.csv", row.names = FALSE, append = FALSE, col.names = TRUE, sep = ",")

I highly recommending exploring the data.table package in R if you are dealing with huge data sets and running into memory issues. — rbatt, Dec 07 '15 at 14:31

score 1 · Accepted Answer · edited May 23 '17 at 12:15

Referring to the accepted answer to Monitor memory usage in R, the amount of memory used by R can be tracked with gc().

If the script is, indeed, running short of memory (which would not surprise me), the easiest way to resolve the problem would be to move the write.table() from the outside to the inside of the loop, to replace the rbind(). It would just be necessary to create a new file name for the CSV file that is written from each output, e.g. by:

csvFileName <- sprintf("compareRegAndRev%03d_%03d.csv",i,j)

If the CSV files are written without headers, they could then be concatenated separately outside R (e.g. using cat in Unix) and the header added later.

While this approach might succeed in creating the CSV file that is sought, it is possible that file might be too big to process subsequently. If so, it may be preferable to process the CSV files individually, rather than concatenating them at all.

Huge data file and running multiple parameters and memory issue, Fisher's test

1 Answers1