Simulating data in R with multiple probability distributions

Question

I am trying to simulate data via bootstrapping to create confidence bands for my real data with a funnel plot. I am building on the strategy of the accepted answer to a previous question. Instead of using a single probability distribution for simulating my data I want to modify it to use different probability distributions depending on the part of the data being simulated.

I greatly appreciate anyone who can help answer the question or help me phrase the question more clearly.

My problem is writing the appropriate R code to do a more complicated form of data simulation.

The current code is:

n <- 1e4
set.seed(42)
sims <- sapply(1:80, 
               function(k) 
                 rowSums(
                   replicate(k, sample((1:7)/10, n, TRUE, ps))) / k)

This code simulates data where each data point has a value which is the mean of between 1:80 observations. For example, when the values of the data points are the mean of 10 observations (k=10) it randomly samples 10 values (which can be either 0.1,0.2,0.3, 0.4, 0.5,0.6 or 0.7) based on a probability distribution ps, which gives the probability of each value (based on the entire empirical distribution).

ps looks like this:

ps <- prop.table(table((DF$mean_score)[DF$total_number_snps == 1]))
#        0.1         0.2         0.3         0.4         0.5         0.6         0.7 
#0.582089552 0.194029851 0.124378109 0.059701493 0.029850746 0.004975124 0.004975124

eg probability that the value of an observation is 0.1 is 0.582089552.

Now instead of using one frequency distribution for all simulations I would like to use different frequency distributions conditionally depending on the number of observations underlying each datapoint.

I made a table, cond_probs, that has a row for each of my real data points. There is a column with the total number of observations and a column giving the frequency of each of the values for each observation.

Example of the cond_probs table:

gene_name   0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 total
A1  0.664   0.319   0.018   0.000   0.000   0.000   0.000   0.000   0.000   113.000
A2  0.000   1.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   1.000

So for the data point A2, there is only 1 observation, which has a value of 0.1. Therefore the frequency of the 0.1 observations is 1. For A1, there are 113 observations and the majority of those (0.664) have the value 0.1. The idea is that cond_probs is like ps, but cond_probs has a probability distribution for each data point rather than one for all the data.

I would like to modify the above code so that the sampling is modified to use cond_probs instead of ps for the frequency distribution. And to use the number of observations, k , as a criteria when choosing which row in cond_probs to sample from. So it would work like this:

For data points with k number of observations:

look in the cond_probs table and randomly select a row where the total number of observations is similar in size to k: 0.9k-1.1k. If no such rows exist, continue.

Once a datapoint is selected, use the probability distribution from that line in cond_probs just like ps is used in the original code, to randomly sample k number of observations and output the mean of these observations.

For each of the n iterations of replicate, randomly sample with replacement a new data point from cond_probs, out of all rows where the value of total is similar to the current value of k ( 0.9k-1.1k).

The idea is that for this dataset one should condition which probability distribution to use based on the number of observations underlying a data point. This is because in this dataset the probability of an observation is influenced by the number of observations (genes with more SNPs tend to have a lower score per observation due to genetic linkage and background selection).

UPDATE USING ANSWER BELOW:

I tried using the answer below and it works for the simulated cond_probs data in the example but not for my real cond_probs file. I imported and converted my cond_probs file to a matrix with

cond_probs <- read.table("cond_probs.txt", header = TRUE, check.names = FALSE)
cond_probs <- as.matrix(cond_probs)

and the first example ten rows (out of ~20,000 rows) looks like this:

>cond_probs
       total   0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1.0
[1,]     109 0.404 0.174 0.064 0.183 0.165 0.009 0.000 0.000 0.000 0.000
[2,]     181 0.564 0.221 0.144 0.066 0.006 0.000 0.000 0.000 0.000 0.000
[3,]     289 0.388 0.166 0.118 0.114 0.090 0.093 0.028 0.003 0.000 0.000
[4,]     388 0.601 0.214 0.139 0.039 0.008 0.000 0.000 0.000 0.000 0.000
[5,]     133 0.541 0.331 0.113 0.000 0.008 0.008 0.000 0.000 0.000 0.000
[6,]     221 0.525 0.376 0.068 0.032 0.000 0.000 0.000 0.000 0.000 0.000
[7,]     147 0.517 0.190 0.150 0.054 0.034 0.048 0.007 0.000 0.000 0.000
[8,]     107 0.458 0.196 0.252 0.084 0.009 0.000 0.000 0.000 0.000 0.000
[9,]      13 0.846 0.154 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

If I run:

sampleSize <- 20
set.seed(42)
#replace 1:80 with 1: max number of SNPs in gene in dataset
sims_test <- sapply( 1:50, simulateData, sampleSize )

and look at the means from the sampling with x number of observations I only get a single result, when there should be 20.

for example:

> sims_test[[31]]
[1] 0.1

And sims_test is not ordered in the same way as sims:

>sims_test
   [,1] [,2]      [,3]  [,4] [,5]      [,6]      [,7]   [,8]      [,9]
 [1,]  0.1  0.1 0.1666667 0.200 0.14 0.2666667 0.2000000 0.2375 0.1888889
 [2,]  0.1  0.1 0.1333333 0.200 0.14 0.2333333 0.1571429 0.2625 0.1222222
 [3,]  0.1  0.1 0.3333333 0.225 0.14 0.1833333 0.2285714 0.2125 0.1555556
 [4,]  0.1  0.1 0.2666667 0.250 0.10 0.1500000 0.2000000 0.2625 0.2777778
 [5,]  0.1  0.1 0.3000000 0.200 0.16 0.2000000 0.2428571 0.1750 0.1000000
 [6,]  0.1  0.1 0.3666667 0.250 0.16 0.1666667 0.2142857 0.2500 0.2000000
 [7,]  0.1  0.1 0.4000000 0.300 0.12 0.2166667 0.1857143 0.2375 0.1666667
 [8,]  0.1  0.1 0.4000000 0.250 0.10 0.2500000 0.2714286 0.2375 0.2888889
 [9,]  0.1  0.1 0.1333333 0.300 0.14 0.1666667 0.1714286 0.2750 0.2888889

UPDATE 2

Using cond_probs <- head(cond_probs,n) I have determined that the code works until n = 517 then for all sizes greater than this it produces the same output as above. I am not sure if this is an issue with the file itself or a memory issue. I found that if I remove line 518 and duplicate the lines before several times to make a larger file, it works, suggesting that the line itself is causing the problem. Line 518 looks like this:

9.000   0.889   0.000   0.000   0.000   0.111   0.000   0.000   0.000   0.000   0.000

I found another 4 offending lines:

9.000   0.444   0.333   0.111   0.111   0.000   0.000   0.000   0.000   0.000   0.000

9.000   0.444   0.333   0.111   0.111   0.000   0.000   0.000   0.000   0.000   0.000

9.000   0.111   0.222   0.222   0.111   0.111   0.222   0.000   0.000   0.000   0.000

9.000   0.667   0.111   0.000   0.000   0.000   0.222   0.000   0.000   0.000   0.000

I don't notice anything unusual about them. They all have a 'total' of 9 sites. If I remove these lines and run the 'cond_probs' file containing only the lines BEFORE these then the code works. But there must be other problematic lines as the entire 'cond_probs' still doesn't work.

I tried putting these problematic lines back into a smaller 'cond_probs' file and this file then works, so I am very confused as it doesn't seem the lines are inherently problematic. On the other hand the fact they all have 9 total sites suggests some kind of causative pattern.

I would be happy to share the entire file privately if that helps as I don't know what to do next for troubleshooting.

One further issue that comes up is I'm not sure if the code is working as expected. I made a dummy cond_probs file where there are two data points with a 'total' of '1' observation:

total   0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
1.000   0.000   0.000   0.000   0.000   0.000   1.000   0.000   0.000   0.000   0.000
1.000   0.000   1.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000

So I would expect them to both be sampled for data points with '1' observation and therefore get roughly 50% of observations with a mean of '0.2' and 50% with a mean of '0.6'. However the mean is always 0.2:

sims_test[[1]]
 [1] 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2

Even if I sample 10000 times all observations are 0.2 and never 0.6. My understanding of the code is that it should be randomly selecting a new row from cond_probs with similar size for each observation, but in this case is seems not to be doing so. Do I misunderstand the code or is it still a problem with my input not being correct?

The entire cond_probs file can be found at the following address:

cond_probs

UPDATE 3

Changing sapply to lapply when running the simulations fixed this issue.

Another reason I think leaving cond_probs as it is and choosing a distribution sampleSize number of times might be the best solution: The probability of choosing a distribution should be related to its frequency in cond_probs. If we combine distributions the odds of picking a distribution with total 9 or 10 will no longer depend on the number of observations with these totals. Example: If there are 90 distributions with total=10 and 10 with total=9 there should be a 90% chance to choose a distribution with total=10. If we combine distributions wouldn't the odds become 50/50 for choosing a distribution with 'total'= 9 or 10 (which would not be ideal)?

I would encourage you to look up [bootstrapping](https://en.wikipedia.org/wiki/Bootstrapping) and [conditional probability](https://en.wikipedia.org/wiki/Conditional_probability). — alexwhitworth, Nov 06 '15 at 21:38
It's not clear to me exactly how you wish to solve your problem or if your proposed solution is reasonable. You appear to be confusing the **observed frequencies** (conditional or otherwise) in your data with the **probability distribution** of your data... perhaps this is a reasonable approach via a bootstrap, or perhaps it's not and you should be looking at the conditional posterior distribution with an informative / non-informative prior.... I'm personally just not clear based on your description — alexwhitworth, Nov 06 '15 at 21:41
thanks for the constructive comments. You're right that I was being sloppy with the terminology and this is essentially a bootstrapping approach. I will try and correct that. Though I remain confused regarding the difference between a probability distribution and the observed frequencies in the empirical distribution. If I want to resample from the empirical distribution with a probability based on the observed frequency distribution, do the frequency distribution and the probability distribution not refer to the same thing? — user964689, Nov 06 '15 at 21:58
Consider P(Y|X= x*) ~ N(\mu,\sigma) (or some f(\theta)). You observe y_1 , ..., y_10 from such distribution. Bootstrap samples from y_1, ... y_10 are clearly not the same thing as sampling from N(\mu, \sigma) or from a posterior based on prior N(\mu, \sigma) and observed likelihood from y_1, ..., y_10. That is my point. Again, perhaps the bootstrap is fine for your problem, 'the frequency distribution and the probability distribution [do] not refer to the same thing'. — alexwhitworth, Nov 06 '15 at 22:32

mra68 · Accepted Answer · 2015-11-08T13:43:14.353

4

I simply wrote a function ps that chooses an appropriate distribution from cond_probs:

N <- 10  # The sampled values are 0.1, 0.2, ... , N/10
M <- 8   # number of distributions in "cond_probs"

#-------------------------------------------------------------------
# Example data:

set.seed(1)

cond_probs <- matrix(0,M,N)

is.numeric(cond_probs)

for(i in 1:nrow(cond_probs)){ cond_probs[i,] <- dnorm((1:N)/M,i/M,0.01*N) }

is.numeric(cond_probs)

total <- sort( sample(1:80,nrow(cond_probs)) )
cond_probs <- cbind( total, cond_probs/rowSums(cond_probs) )

colnames(cond_probs) <- c( "total", paste("P",1:N,sep="") )

#---------------------------------------------------------------------
# A function that chooses an appropiate distribution from "cond_prob",
# depending on the number of observations "numObs":

ps <- function( numObs,
                similarityLimit = 0.1 )
{
  similar <- which( abs(cond_probs[,"total"] - numObs) / numObs < similarityLimit )

  if ( length(similar) == 0 )
  { 
    return(NA)
  }
  else
  {
    return( cond_probs[similar[sample(1:length(similar),1)],-1] )
  }
}

#-----------------------------------------------------------------
# A function that simulates data using a distribution that is
# appropriate to the number of observations, if possible:

simulateData <- function( numObs, sampleSize )
{
  if (any(is.na(ps(numObs))))
  {
    return (NA)
  }
  else
  {
    return( rowSums(
              replicate(
                numObs,
                replicate( sampleSize, sample((1:N)/10, 1, prob = ps(numObs))))
            ) / numObs )
  }
}

#-----------------------------------------------------------------
# Test:

sampleSize <- 30
set.seed(42)
sims <- lapply( 1:80, simulateData, sampleSize )

The distributions in cond_probs:

    total           P1           P2           P3           P4           P5           P6           P7           P8           P9          P10
[1,]    16 6.654875e-01 3.046824e-01 2.923948e-02 5.881753e-04 2.480041e-06 2.191926e-09 4.060763e-13 1.576900e-17 1.283559e-22 2.189990e-28
[2,]    22 2.335299e-01 5.100762e-01 2.335299e-01 2.241119e-02 4.508188e-04 1.900877e-06 1.680045e-09 3.112453e-13 1.208647e-17 9.838095e-23
[3,]    30 2.191993e-02 2.284110e-01 4.988954e-01 2.284110e-01 2.191993e-02 4.409369e-04 1.859210e-06 1.643219e-09 3.044228e-13 1.182153e-17
[4,]    45 4.407425e-04 2.191027e-02 2.283103e-01 4.986755e-01 2.283103e-01 2.191027e-02 4.407425e-04 1.858391e-06 1.642495e-09 3.042886e-13
[5,]    49 1.858387e-06 4.407417e-04 2.191023e-02 2.283099e-01 4.986746e-01 2.283099e-01 2.191023e-02 4.407417e-04 1.858387e-06 1.642492e-09
[6,]    68 1.642492e-09 1.858387e-06 4.407417e-04 2.191023e-02 2.283099e-01 4.986746e-01 2.283099e-01 2.191023e-02 4.407417e-04 1.858387e-06
[7,]    70 3.042886e-13 1.642495e-09 1.858391e-06 4.407425e-04 2.191027e-02 2.283103e-01 4.986755e-01 2.283103e-01 2.191027e-02 4.407425e-04
[8,]    77 1.182153e-17 3.044228e-13 1.643219e-09 1.859210e-06 4.409369e-04 2.191993e-02 2.284110e-01 4.988954e-01 2.284110e-01 2.191993e-02

The means of the distributions:

> cond_probs[,-1] %*% (1:10)/10
          [,1]
[1,] 0.1364936
[2,] 0.2046182
[3,] 0.3001330
[4,] 0.4000007
[5,] 0.5000000
[6,] 0.6000000
[7,] 0.6999993
[8,] 0.7998670

Means of the simulated data for 31 observations:

> sims[[31]]
 [1] 0.2838710 0.3000000 0.2935484 0.3193548 0.3064516 0.2903226 0.3096774 0.2741935 0.3161290 0.3193548 0.3032258 0.2967742 0.2903226 0.3032258 0.2967742
[16] 0.3129032 0.2967742 0.2806452 0.3129032 0.3032258 0.2935484 0.2935484 0.2903226 0.3096774 0.3161290 0.2741935 0.3161290 0.3193548 0.2935484 0.3032258

The appopriate distribution is the third one:

> ps(31)
          P1           P2           P3           P4           P5           P6           P7           P8           P9          P10 
2.191993e-02 2.284110e-01 4.988954e-01 2.284110e-01 2.191993e-02 4.409369e-04 1.859210e-06 1.643219e-09 3.044228e-13 1.182153e-17

edited Nov 08 '15 at 13:43

answered Nov 04 '15 at 13:08

mra68

2,960
1
10
17

thanks so much. I will give this a try and let you know. – user964689 Nov 04 '15 at 21:21
Your answer works perfectly for me with your cond_probs file but not when I apply it to my actual cond_probs file. I updated the answer to show what happens when I try it. Do you know why my file might yield different results? Thanks again for your help, I think it must just be a minor file format issue that is stopping it from working. – user964689 Nov 06 '15 at 19:26
The structure of `cond_probs` is exactly as wanted: `total` must be the first column. I wasn't able to reproduce the error and enhanced my answer by the result of the test using the `cond_probs` matrix in the question. As expected, the resulting `sims_test` is a list of length 50. But the `sims_test` shown in the question looks like a 9-by-9 matrix. I suggest that you copy and paste the script in my answer and see if you get the same result. If so, read again `cond_probs` from the file and replace it by `head(cond_probs,n)` with inceasing numbers n. n=9 is the example above. – mra68 Nov 07 '15 at 01:29
Thanks for your comments and helpful advice. I followed your suggestions and found that it stops working at line 518. If I remove this line from the first 600 it works, but still the whole 'cond_probs' file doesn't work, so I assume there are other exceptions. I added the offending line to the question. Any idea what is causing the problem? – user964689 Nov 07 '15 at 11:53
I found more offending lines (see updated answer) but when I put these lines back into a smaller 'cond_probs' file it doesn't stop it working, so I am unsure what the underlying problem is. Do you have any idea what it might be? I would be happy to send you the entire file (21813 lines) if that would help (it's too large to paste into the question). – user964689 Nov 07 '15 at 12:20
The issue with `total`=1 can be explained by the order in with we choose the distribution and take the sample. Until now we first choose a distribution and then take a sample using the chosen distribution. Since `nObs` is 1, this happens only once. Hence the one and only sample comes from ONE of the two possible distributions. Of cause we could `sampleSize` times choose a possible distribution, take a sample of size 1, and then combine the sampled values to a sample of size `sampleSize`. But an easier way to achieve this is to combine all distributions for a given `sampleSize`. – mra68 Nov 07 '15 at 17:48
thanks. Actually choosing a possible distribution 'sampleSize' times is exactly what I am after. Is your easier suggestion to make a 'cond_probs' file that already combines the distributions for all observations with a given 'total'? I'm not sure that would work as I also want to sample distributions from observations with a similar 'total'. But I may have misunderstood you. – user964689 Nov 07 '15 at 18:01
If one of the distributions for a given "total" is similar enough then all of them are similar enough. Hence we can combine them in advance. – mra68 Nov 07 '15 at 18:09
Ok so you are only suggesting combining distributions which have the same 'total'? That makes sense. That would be more effective than choosing a distribution 'samplesize' times? – user964689 Nov 07 '15 at 18:14
I added a link to the entire 'cond_probs' file, hosted by a file uploading service as you suggested. – user964689 Nov 07 '15 at 18:31
I've replaced `sapply` by `lapply`, Now the result is always a list, even if the whole "cond_probs" file is used. I will make the combination of the distributions later. – mra68 Nov 08 '15 at 01:17
Thanks so much that's fantastic 'lapply' works. For combining the distributions I am not sure if it will work like choosing distributions a 'sampleSize' number of times. Example: when sampling for distributions with '10' observations we want to sample from distributions with a 'total' of '9', '10' and '11'. Choosing distributions a 'sampleSize' amount of times achieves this. The number of distributions chosen should depend on sampleSize and be independent of 'nObs' and be the same across 'nObs'. – user964689 Nov 08 '15 at 10:04
Just added an example to the end of the question to try and explain why I think choosing distributions a sampleSize amount of times would behave preferably and differently to a combining the distributions in cond_probs. Let me know if I just misunderstood you. Thanks very much. – user964689 Nov 08 '15 at 10:13
I've edited my answer. Now the distribution is chosen `samlpeSize` times. – mra68 Nov 08 '15 at 13:45
Thank you so much I am extremely grateful for this, you have helped me enormously. – user964689 Nov 08 '15 at 14:18

Simulating data in R with multiple probability distributions

UPDATE 2

UPDATE 3

1 Answers1

Linked