r2dtable contingency tables are too concentrated

Question

I am using R's r2dtable function to generate contingency tables with given marginals. However, when inspecting the resulting tables values look somewhat too concentrated to the midpoints. Example:

set.seed(1)
matrices <- r2dtable(1e4, c(100, 100), c(100, 100))
vec.vals <- vapply(matrices, function(x) x[1, 1], numeric(1))

> table(vec.vals)
vec.vals
  36   37   38   39   40   41   42   43   44   45   46   47   48   49   50   51 
   1    1    1    7   25   49  105  182  268  440  596  719  954 1072 1152 1048 
  52   53   54   55   56   57   58   59   60   61   62 
1022  775  573  404  290  156   83   50   19    6    2

So the minimal upper left corner value is 36 and the max is 62 out of 10,000 simulations.

Is there a way to achieve somewhat less concentrated matrices?

score 2 · Answer 1 · answered Aug 25 '17 at 17:25

You need to consider that it would be extremely unlikely that any given random draw would have a value with and upper left corner of 35. 1e4 attempts may not be sufficient to realize such an event. Look at the theoretic predictions (courtesy of P. Dalgaard on Rhelp list this morning.):

 round(dhyper(0:100,100,100,100)*1e4)
  [1]    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
 [18]    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
 [35]    0    0    0    1    4    9   21   45   88  160  269  417  596  787  959 1081 1124
 [52] 1081  959  787  596  417  269  160   88   45   21    9    4    1    0    0    0    0
 [69]    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
 [86]    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0

If you increase the number of draws the probability of a single value of 1 "widens":

vec.vals <- vapply(matrices, function(x) x[1, 1], numeric(1)); table(vec.vals)
vec.vals
    33     34     35     36     37     38     39     40     41     42     43     44     45 
     1      3      8     47    141    359    864   2148   4515   8946  15928  27013  41736 
    46     47     48     49     50     51     52     53     54     55     56     57     58 
 59558  78717  96153 108322 112524 107585  96042  78054  60019  41556  26848  16134   8627 
    59     60     61     62     63     64     65     66     68 
  4580   2092    933    351    138     42     11      4      1

... as predicted:

round(dhyper(0:100,100,100,100)*1e6)
  [1]      0      0      0      0      0      0      0      0      0      0      0      0
 [13]      0      0      0      0      0      0      0      0      0      0      0      0
 [25]      0      0      0      0      0      0      0      0      0      1      4     13
 [37]     43    129    355    897   2087   4469   8819  16045  26927  41700  59614  78694
 [49]  95943 108050 112416 108050  95943  78694  59614  41700  26927  16045   8819   4469
 [61]   2087    897    355    129     43     13      4      1      0      0      0      0
 [73]      0      0      0      0      0      0      0      0      0      0      0      0
 [85]      0      0      0      0      0      0      0      0      0      0      0      0
 [97]      0      0      0      0      0

score 1 · Answer 2 · answered May 18 '16 at 21:16

1

To get less concentrated matrices, you will have to find a balance between the number of columns / rows, totals and number of matrices. Consider the following sets:

m2rep <- r2dtable(1e4, rep(100,2), rep(100,2))
m2seq <- r2dtable(1e4, seq(50,100,50), seq(50,100,50))

which gives differences in number of unique value:

> length(unique(unlist(m2rep)))
[1] 29
> length(unique(unlist(m2seq)))
[1] 58

plotting this with:

par(mfrow = c(1,2))
plot(table(unlist(m2rep)))
plot(table(unlist(m2seq)))

gives:

Now consider:

m20rep <- r2dtable(1e4, rep(100,20), rep(100,20))
m20seq <- r2dtable(1e4, seq(50,1000,50), seq(50,1000,50))

which gives:

> length(unique(unlist(m20rep)))
[1] 20
> length(unique(unlist(m20seq)))
[1] 130

plotting this with:

par(mfrow = c(1,2))
plot(table(unlist(m20rep)))
plot(table(unlist(m20seq)))

gives:

As you can see, playing with the parameters helps.

HTH

answered May 18 '16 at 21:16

Jaap

81,064
34
182
193

I would like to keep the marginals as `c(100, 100), c(100, 100)`. I don't see how your solution achieves this. – paljenczy May 19 '16 at 07:33
@paljenczy My solution indeed doesn't achieve that. But because you didn't specify that requirement in your question, I couldn't know ;-) – Jaap May 19 '16 at 07:56
Any ideas how to achieve that? – paljenczy May 19 '16 at 10:05
@paljenczy I will try to look at it this weekend – Jaap May 20 '16 at 08:18

r2dtable contingency tables are too concentrated

2 Answers2