I have troubles to sample within a certain background or excluding some possibilities.
I am trying to create a R function that shuffles genomic regions.
For now the function works well and follow those steps:
- Retrieves all the genomic regions lengths and chromosomes of the query.
- Calculates all the possible starts as the specified chromosome total size minus the length of each query regions.
- Calculates the shuffled genomic regions as the start is sampled from 0 to the possible starts and the width is simply the width of each query regions.
This function uses GenomicRanges object, here is its code:
GrShuffle <- function(regions, chromSizes = LoadChromSizes("hg19")) {
# Gets all the regions lengths from the query.
regionsLength <- regions@ranges@width
# The possible starts are the chromosome sizes - the regions lengths.
possibleStarts <- chromSizes[as.vector(regions@seqnames), ] - regionsLength
# Gets all the random starts from sampling the possible starts.
randomStarts <- unlist(lapply(possibleStarts, sample.int, size = 1))
granges <- GRanges(regions@seqnames, IRanges(start = randomStarts,
width = regionsLength),
strand=regions@strand)
return(granges)
}
But now I need to use a universe, i.e. an other set of regions that will determine in which ranges the randoms will take place. The universe works like a restriction to sampling. It will be another set of regions like the query. And no shuffling should take place outside of those regions.
Any clue on how to sample within ranges in R?
The lapply is important as it drastically reduces the execution time of the function compared to using a loop.
[EDIT]
Here is a reproducible example that does not use GenomicRanges to siplify at maximum what I want to achieve.
## GENERATES A RANDOM QUERY
chromSizes <- c(100,200,250)
names(chromSizes) <- c("1","2","3")
queryChrom <- sample(names(chromSizes), 100, replace = TRUE)
queryLengths <- sample(10, 100, replace = TRUE)
queryPossibleStarts <- chromSizes[queryChrom] - queryLengths
queryStarts <- unlist(lapply(queryPossibleStarts, sample.int, size = 1))
query <- data.frame(queryChrom, queryStarts, queryStarts + queryLengths)
colnames(query) <- c("chrom", "start", "end")
##
##SIMPLIFIED FUNCTION
# Gets all the regions lengths from the query.
regionsLength <- query$end - query$start
# The possible starts are the chromosome sizes - the regions lengths.
possibleStarts <- chromSizes[query$chrom] - regionsLength
# Gets all the random starts from sampling the possible starts.
randomStarts <- unlist(lapply(possibleStarts, sample.int, size = 1))
shuffledQuery <- data.frame(queryChrom, randomStarts, randomStarts + queryLengths)
colnames(shuffledQuery) <- c("chrom", "start", "end")
##