0

I have troubles to sample within a certain background or excluding some possibilities.

I am trying to create a R function that shuffles genomic regions.

For now the function works well and follow those steps:

  1. Retrieves all the genomic regions lengths and chromosomes of the query.
  2. Calculates all the possible starts as the specified chromosome total size minus the length of each query regions.
  3. Calculates the shuffled genomic regions as the start is sampled from 0 to the possible starts and the width is simply the width of each query regions.

This function uses GenomicRanges object, here is its code:

GrShuffle <- function(regions, chromSizes = LoadChromSizes("hg19")) {
    # Gets all the regions lengths from the query.
    regionsLength <- regions@ranges@width
    # The possible starts are the chromosome sizes - the regions lengths.
    possibleStarts <- chromSizes[as.vector(regions@seqnames), ] - regionsLength
    # Gets all the random starts from sampling the possible starts.
    randomStarts <- unlist(lapply(possibleStarts, sample.int, size = 1))
    granges <- GRanges(regions@seqnames, IRanges(start = randomStarts,
                                             width = regionsLength),
                   strand=regions@strand)
    return(granges)
}

But now I need to use a universe, i.e. an other set of regions that will determine in which ranges the randoms will take place. The universe works like a restriction to sampling. It will be another set of regions like the query. And no shuffling should take place outside of those regions.

Any clue on how to sample within ranges in R?

The lapply is important as it drastically reduces the execution time of the function compared to using a loop.

[EDIT]

Here is a reproducible example that does not use GenomicRanges to siplify at maximum what I want to achieve.

## GENERATES A RANDOM QUERY
chromSizes <- c(100,200,250)
names(chromSizes) <- c("1","2","3")
queryChrom <- sample(names(chromSizes), 100, replace = TRUE)
queryLengths <- sample(10, 100, replace = TRUE)
queryPossibleStarts <- chromSizes[queryChrom] - queryLengths
queryStarts <- unlist(lapply(queryPossibleStarts, sample.int, size = 1))
query <- data.frame(queryChrom, queryStarts, queryStarts + queryLengths)
colnames(query) <- c("chrom", "start", "end")
##

##SIMPLIFIED FUNCTION
# Gets all the regions lengths from the query.
regionsLength <- query$end - query$start
# The possible starts are the chromosome sizes - the regions lengths.
possibleStarts <- chromSizes[query$chrom] - regionsLength
# Gets all the random starts from sampling the possible starts.
randomStarts <- unlist(lapply(possibleStarts, sample.int, size = 1))
shuffledQuery <- data.frame(queryChrom, randomStarts, randomStarts + queryLengths)
colnames(shuffledQuery) <- c("chrom", "start", "end")
##
Neil Lunn
  • 148,042
  • 36
  • 346
  • 317
  • Can you add a reproducible example? To me it is also not clear what "range" means... – Christoph Apr 12 '17 at 08:57
  • For me it is totally unclear what your problem is actually. What is your "universe, i.e. an other set of regions" in your example? – Roman Apr 12 '17 at 09:35
  • The universe works like a restriction to sampling. It will be another set of regions like the query. And no shuffling should take place outside of those regions. – Zacharie Ménétrier Apr 12 '17 at 09:38

0 Answers0