0

I am trying to perform enrichment analysis using phyper function from R. The code I have written is giving me the accurate results but it takes forever when the size of a matrix gets increased. Below is the reproducible example for 621 * 1860 matrix. But when the size of matrix is increased to 6210 X 24000, it takes me almost a day to complete even when I run it on multiple cores. I am wondering if there is a way to optimize the same.

Please download the three RObjects from the shareable link present in the comments.

## Main Functions
GetEnrichedAnnotations <- function(DrugDiseaseName,
                               DrugDiseaseGeneMatrix,
                               DrugDiseaseFeatureMatrix,
                               DFDrgDis){ ## Function Begins
  TotalGenesCount = nrow(DrugDiseaseGeneMatrix)
  ## Get the assosciated Genes for each Drug or Disease
  DrugDiseaseGenes = GetGeneList(DrugDiseaseName,DFDrgDis)
  ## Get the only annotations that Genes from the Drug or Disease List 
  UPDNAnnotations = DrugDiseaseFeatureMatrix[DrugDiseaseName,]
  UPDNAnnotations = UPDNAnnotations[UPDNAnnotations > 0]
  ## First value to the HyperGeometricFunction phyper
  GenesFromInput = DrugDiseaseFeatureMatrix[DrugDiseaseName,names(UPDNAnnotations)]
  ## Second value to the HyperGeometricFunction phyper
  GenesinAnnotation = DrugDiseaseGeneMatrix[,names(UPDNAnnotations)]
  GenesinAnnotation = apply(GenesinAnnotation,2,sum)
  ## Third Value  to the HyperGeometricFunction phyper
  TotalGenes = rep(TotalGenesCount,length(GenesFromInput))
  RemainingGenes = TotalGenes - GenesinAnnotation
  ## Fourth value to the HyperGeometricFunction phyper
  NumberOfGenesInDrug = rep(length(DrugDiseaseGenes),length(GenesFromInput))
  names(NumberOfGenesInDrug) = names(GenesFromInput)
  ## Apply Enrichment ANalysis
  PValues = phyper(GenesFromInput-1,GenesinAnnotation,RemainingGenes,NumberOfGenesInDrug,lower.tail = FALSE)
  AdjustedPvalues = p.adjust(PValues,method = "BH")
  EnrichedAnnotations = AdjustedPvalues[AdjustedPvalues <= 0.05]
  ### When P value is zero, replacing zeros with the minimum value
  EnrichedAnnotations[EnrichedAnnotations == 0] = 2.2e-16
  EnrichedAnnotations = EnrichedAnnotations[EnrichedAnnotations <= 0.05]
  ## Get the log value for the adjusted Pvalues
  EnrichedAnnotations = -log(EnrichedAnnotations,2)
  ## This vector consists of all the annotations including Enriched Annotations
  TotalAnnotaionsVector = rep(0,ncol(DrugDiseaseGeneMatrix))
  names(TotalAnnotaionsVector) = colnames(DrugDiseaseGeneMatrix)
  TotalAnnotaionsVector[names(EnrichedAnnotations)] = EnrichedAnnotations
  return(TotalAnnotaionsVector)
  }##Function Ends


## Get GeneList for a given Diseases
GetGeneList = function(DiseaseName,DFDrgDis){
  GeneList = DFDrgDis[DFDrgDis$DrugName == DiseaseName,"Symbol"]
  GeneList = unlist(strsplit(GeneList,","))
  GeneList = trimws(GeneList)
  GeneList = unique(GeneList)
  return(GeneList)
}


## Parraleize the Code
numberofCores = parallel::detectCores() - 1
### Closing all the Existing Connections
closeAllConnections()
### Making a Cluster  with 8 of 8 available cores
Cluster <- parallel::makeCluster(numberofCores)
### Register the Cluster
doParallel::registerDoParallel(Cluster)





## Please download the RObject from below link
## Please download the three objects for reproducible example
## https://drive.google.com/drive/folders/0Bz9Y4BgZAF7oS2dtVVEwN0Z1Tnc?usp=sharing

GeneAnnotations = readRDS("Desktop/StackOverFlow/GeneAnnotations.rds")
DiseaseAnnotations =  readRDS("Desktop/StackOverFlow/DiseaseAnnotations.rds") 
Diseases =  readRDS("Desktop/StackOverFlow/Diseases.rds") 
## Get the Unique Names of Disease List
DisNames = row.names(DiseaseAnnotations)

## Below Function runs the code on parallel to get the Enriched Annotations for Multiple Drugs or Diseases
## Get the Enriched Annotaions for all the Diseases UP Regulated EnricR Genes
library(foreach)
EnrichedAnnotations <- foreach(i=1:length(DisNames), .export= c('GetGeneList'),.packages = "Matrix") %dopar% {
  GetEnrichedAnnotations(DisNames[i],GeneAnnotations,DiseaseAnnotations,Diseases)
}

## Convert to Matrix
EnrichedAnnotations <- do.call("cbind",EnrichedAnnotations)
EnrichedAnnotations = t(EnrichedAnnotations)

colnames(EnrichedAnnotations) = colnames(EnrichedAnnotations)
rownames(EnrichedAnnotations) = DisNames


## Stop the Cluster
parallel::stopCluster(Cluster)
F. Privé
  • 11,423
  • 2
  • 27
  • 78
Jason Mathews
  • 265
  • 1
  • 3
  • 13
  • Using TopGO is no go in this case? – Damiano Fantini Aug 25 '17 at 19:14
  • TopGO takes more time than the above script to generate the enriched matrices. – Jason Mathews Aug 25 '17 at 20:20
  • My feeling is that you could speed things up (a little) if you modify the custom function so that it keeps data in memory and iterates through the diseases you want to test. Briefly, instead of passing 1 DrugDiseaseName at a time, you pass a vector and the function iterates... I'll look into this later, cause I think it's a very interesting question. – Damiano Fantini Aug 25 '17 at 20:25
  • Thanks. I will try with the approach you suggested and will keep this post updated in case I see any improvement in performance. I am sure, the function I have written can be optimized. Will wait for fellow coders to reply to this. – Jason Mathews Aug 25 '17 at 20:31
  • You could replace `DFDrgDis` with a `data.table` that has two columns: `DiseaseName` and `Symbol`. Each unique value of `DiseaseName` would have one row for each unique value of `Symbol`. This way, you're using `data.table` for fast index lookup, and don't have to call `strsplit`, `unlist`, `trimws`, and `unique` for every lookup. Also, as @Damiano mentioned, vectorize this logic. – Nathan Werth Aug 25 '17 at 20:39

1 Answers1

0

If you quickly profile your SEQUENTIAL version, you see that almost all time is taken by these two lines:

GenesinAnnotation = DrugDiseaseGeneMatrix[,names(UPDNAnnotations)]
GenesinAnnotation = apply(GenesinAnnotation,2,sum)

You can precompute the colSums (so that you don't compute multiple times the same thing) and you actually don't need to subset the data (you could subset the result).

So, you just need to replace them by:

GenesinAnnotation <- GenesinAnnotation0[names(UPDNAnnotations)]

And precompute

GenesinAnnotation0 <- colSums(DrugDiseaseGeneMatrix)
F. Privé
  • 11,423
  • 2
  • 27
  • 78