1

The problem:

I have been sampling 5 categories over 6 months and their effects on certain enviromental activity over the months their proportion has varied like this:

| Month|         A|         B|         C|         D|         E|
|-----:|---------:|---------:|---------:|---------:|---------:|
|     1| 0.6666667| 0.3012821| 0.0320513| 0.0000000| 0.0000000|
|     2| 0.5603448| 0.1494253| 0.1408046| 0.1235632| 0.0258621|
|     3| 0.1962843| 0.0961228| 0.3400646| 0.2285945| 0.1389338|
|     4| 0.1135647| 0.0368034| 0.4090431| 0.2954784| 0.1451104|
|     5| 0.0799087| 0.0182648| 0.3812785| 0.3835616| 0.1369863|
|     6| 0.0854701| 0.0085470| 0.3760684| 0.4316239| 0.0982906|

Where, as you can see the A and B have gone down and C, D and E have gone up with this correlation:

|          A|          B|          C|          D|          E|
|----------:|----------:|----------:|----------:|----------:|
|  1.0000000|  0.9402901| -0.9885358| -0.9437185| -0.9358701|
|  0.9402901|  1.0000000| -0.9511070| -0.9612210| -0.8413999|
| -0.9885358| -0.9511070|  1.0000000|  0.9139291|  0.9559101|
| -0.9437185| -0.9612210|  0.9139291|  1.0000000|  0.7789632| 
| -0.9358701| -0.8413999|  0.9559101|  0.7789632|  1.0000000|

dataset is given by:

Cor_By_Month <- structure(c(1, 0.940290075149674, -0.988535776442558, -0.943718544223924, 
            -0.935870083299231, 0.940290075149674, 1, -0.951106988627249, 
            -0.961220998780756, -0.841399937722727, -0.988535776442558, -0.951106988627249, 
            1, 0.913929137201831, 0.955910074676834, -0.943718544223924, 
            -0.961220998780756, 0.913929137201831, 1, 0.778963196453952, 
            -0.935870083299231, -0.841399937722727, 0.955910074676834, 0.778963196453952, 
            1), .Dim = c(5L, 5L), .Dimnames = list(NULL, c("A", "B", "C", "D", "E")))

I want to graph the response curves of my models, but instead of varying A from 0, to 1 and then keeping the other classes to the mean, I want all proportions to add 1 and to have the proper correlation values.

Expected solution

a Data Frame with at least 100 samples, where all the Categories (A to E) vary from 0 to 1, with several intermediate values with every row adding 1 and the correlation between variables to stay at the correlation given by the Cor_By_Month dataset:

What I have tried:

Using mvnorm from MASS

I know this is not the best way of dealing with this, since this is not necessarily normal data, but it is so far the only way I have found to do this:

So: knowing that the mean values of my 5 classes is:

Means <- c(0.283706542309262, 0.101740888487065, 0.279885087917025, 0.243803624143928, 
0.0908638571427198)

And that the correlation is given by Cor_By_Month

I tried:

out <- as.data.frame(mvrnorm(1000, mu = Means,
                               Sigma = Cor_By_Month,
                               empirical = T))

but of course the values go all over the place and don't conform to my 0, to 1 values despite having the needed correlation values, in order to try to correct that I scaled it by the min and max value of each column:

  mins <- apply(out, 2, min)
  maxs <- apply(out, 2, max)
  out <- scale(out, center = mins, scale = maxs - mins)

So now I fixed one of my 2 problems, all the Values of A to E are between 0 and 1, but all the rows of my data frame sum values way over one.

To fix this I tried the following:

out <- as.data.frame(mvrnorm(1000, mu = runif(n = 5),
                               Sigma = Cor_By_Month_Polity,
                               empirical = F)) 
  mins <- apply(out, 2, min)
  maxs <- apply(out, 2, max)
  out <- scale(out, center = mins, scale = maxs - mins) %>% 
  as.data.frame() %>% 
  rowwise() %>% 
  mutate(Total = sum(c_across(V1:V5))) %>% 
  mutate_at(vars(V1:V5), ~./Total) %>% 
  rowwise() %>% 
  mutate(Total = sum(c_across(V1:V5))) %>% 
  as.data.frame()

Now everything adds to 1 row-wise, but it is not common for any proportion to have a value of over 0.5, and I have tried doing 300000 with no value over 0.54.

I am sure there are better solutions to what I am trying to do

Derek Corcoran
  • 3,930
  • 2
  • 25
  • 54
  • One approach would be to use `rmultinom` to generate multiple samples for each month using the proportions for that month and then combine one sample from each month. That would approximate your correlation matrix since the samples would approximate the observed proportions for each month, but the correlation matrix would not be used in constructing the samples. – dcarlson Sep 01 '20 at 04:52

0 Answers0