Creating a contingency table by hypergeometric sampling with the Titanic's database

Question

I created a contingency table with the passengers data from the Titanic by the Hypergeometric sampling -That's mean that both of the marginal totals are preset and equals-. It was created crossing the Sex and Survivor columns of 328 cases -164 men and 164 women-, this is the code:

First, I ungroup the data and deleted the useless columns

titanic = as.data.frame(Titanic)
titanic = titanic[rep(1:nrow(titanic),titanic$Freq),]
titanic = titanic[,c(2,4)]

later, selected a sample of men

men = subset(titanic, titanic$Sex == 'Male')
men = men [sample(nrow(men),164), ]
table(men$Sex, men$Survived)

#           No Yes
#   Male   133  31
#   Female   0   0

now the row of women must be filled in with the appropriate values

n = summary.factor(men$Survived)
womenYes = subset(titanic, (titanic$Sex == 'Female' & titanic$Survived=='Yes'))
womenYes = subset(womenYes[1:n[1], ])
womenNo = subset(titanic, (titanic$Sex == 'Female' & titanic$Survived=='No'))
womenNo = subset(womenNo[1:n[2], ])
women = merge(womenYes, womenNo, all = TRUE)
hyperSample = merge(men, women, all = TRUE)
table(hyperSample$Sex, hyperSample$Survived)

#           No Yes
#   Male   133  31
#   Female  31 133

It works, but it looks like a bit ugly and I honestly think perhaps someone could find a much more elegant or efficient way to do it. Thanks.

Thanks for the quick response @alistaire, but..no. It must be made by Hypergeometric sampling...marginal totals are preset and equals, and 328 cases — Ángel, Jul 21 '18 at 22:00
@Ángel I'm pretty sure your "answer" is flawed. If you sample under the condition of 328 total cases, then you should not force the Male survivors to equal the number of female decedents — IRTFM, Jul 24 '18 at 00:11

IRTFM · Answer 1 · 2018-07-24T16:41:36.287

0

You can sample in two stages, both using rhyper: First to determine the number of men and women subject to only sampling 328 and assuming populations were sex-distributed as in the original sample. This is what you might do if you were trying to bootstrap a statistic like a rate ratio. And then secondly, use rhyper twice more to determine the numbers of survivors subject to the same probabilities in the original sample rows.

 MFmat <- apply(Titanic, c(2, 4), sum)
 nMale <- rhyper(1, rowSums(MFmat)[1], rowSums(MFmat)[2], 328)
#[1] 262
 nFemale <- 328 - nMale
 DMale <- rhyper(1, MFmat[1,1], MFmat[1,2], nMale)
 SurvMale = nMale-DMale
 DFemale = rhyper(1, MFmat[2,1], MFmat[2,2], nFemale)
 SurvFemale = nFemale - DFemale
 matrix( c( DMale, DFemale, SurvMale, SurvFemale), ncol=2, 
dimnames=dimnames(MFmat) )
#----
        Survived
Sex       No Yes
  Male   223  42
  Female  22  41

I suppose you could sample the two rows separately and you should be able to use the logic above, ... if that what you have decided to do. Which way is more appropriate will depend on the underlying problem.

# Fixed row marginals....
   nMale <-164
  nFemale <- 164
  DMale <- rhyper(1, MFmat[1,1], MFmat[1,2], nMale)
  SurvMale = nMale-DMale
  DFemale = rhyper(1, MFmat[2,1], MFmat[2,2], nFemale)
  SurvFemale = nFemale - DFemale
  matrix( c( DMale, DFemale, SurvMale, SurvFemale), ncol=2, 
 dimnames=dimnames(MFmat) )
#----------------
        Survived
Sex       No Yes
  Male   127  37
  Female  39 125

edited Jul 24 '18 at 16:41

answered Jul 24 '18 at 00:28

IRTFM

258,963
21
364
487

Let me explain a little more with various comments: Poisson sampling it's when the n value is not preset for example, interview people for an hour -we don't now how many people will be-. Multinomial sampling its the same but the n value is now preset. In the same example, we phone 20, 50 1000..never mind, the important thing it's set the sampling size. – Ángel Jul 24 '18 at 09:20
Independent Multinomial sampling: Now the marginal totals of a factor is preset too, for example we conducted a study with 140 patients on placebo and 139 patients on a flu vaccine. The totals are already preset. – Ángel Jul 24 '18 at 09:20
Hypergeometric sampling: The marginals totals of ALL the factors are preset. A woman bet with Fisher in 1934 that she was able to tell if milk was poured into the cup before tea. In 4 cups the tea had been put in first place and therefore in 4 cups the milk had been put in first place. So, marginals are set before the experiment. As you can see, the n size is not relevant for the samplig, its value is purely arbitrary, the important in this case is to set the marginal values preset. I hope that with this explanation you can understand me better. – Ángel Jul 24 '18 at 09:24
For that reason, my code is correct. In your code the marginal totals are not preset and all must be the same in any factor. It's wrong even if it do add up to the total just in Sex factor -265+63 and 245+83 = 328-. – Ángel Jul 24 '18 at 09:35
Oh, I forgot...with respect to hyperSample, it's not any esoteric function. It's just the name of the dataframe wich we are looking for. – Ángel Jul 24 '18 at 15:12
1

As I said, the code can be revised to accommodate fixed numbers of men and women, but I didn't see why one would do such a thing, and I don't see an explanation yet. Since you didn't appear to understand, I've put in hte code that I thought was obvious. At the moment all you are doing is picking the first few rows of each group and that is NOT any sort of random sampling. – IRTFM Jul 24 '18 at 16:36
I just explained it to you in several comments, with typical examples from statistical literature. If you've never heard of this type of sampling it's not my fault, and I'd appreciate it if you wouldn't at least penalize my question for reasons like "you don't understand / it's wrong" especially when my -not yours- code is correct and without any statistical error -I'm statistician, and for the 3th time, the marginals of the two factors **must be the same** 127+39 != 127+37, you don't have fixed anything again-. The question it's about make cleaner code, no about statistical knowledge. – Ángel Jul 24 '18 at 17:52
The single example you cited to Fisher's classic example was a designed experiment. You did not present an designed experiment, but rather an observational study, so the statistical issues are different. The downvoters are not me. And I don't see how you can avoid severe criticism when you are not doing any random sampling in your code despite requesting same. – IRTFM Jul 24 '18 at 21:16
_you are not doing any random sampling in your code despite requesting same._ 6ª line -> men = men [sample(nrow(men),164), ] So, the values are diferents every time you run the code. Since the two factors are dichotomous - male/female survives/dies- you can fill in the total data by conditioning them to the men's sample to prefix the marginal totals, and all the experiment is random sampling, more specifically, hypergeometric sampling. – Ángel Jul 24 '18 at 21:40
Your code is doing what you are telling it to, namely forcing the number of Female:Survived=no to be equal to 164-Male:Survived=No. You are forcing the resultant matrix to be symmetric. I fail to see how that can be called random sampling. I think you have placed too many constraints on your problem. Perhaps you should consult your academic advisor on this concern? – IRTFM Jul 24 '18 at 21:50
_Perhaps you should consult your academic advisor on this concern?_ Perhaps that was done at the time, and that's why it's correct. If from a sample of 328 people we preselect that the half will be men and survivors, the rest will be women (164), it's obvious. And if, for example, only 25 men survived out of a total of 164, it's because 139 women also survived and that is why we also know how many women died, 25 again.Why preset the totals? because its a HYPERGEOMETRIC SAMPLING. I think that to continue debating this is useless. The reasoning is correct, although you do not believe it. – Ángel Jul 24 '18 at 22:40
Let the record show I was the one who illustrated R’s random sampling function based on the hypergeometric distribution. If both you and your advisor choose to use it in a particular manner, it’s your reputations that are at stake. – IRTFM Jul 25 '18 at 01:06
To calculate a random sample from a hypergeometric distribution is one thing -for example, using rhyper and you'd be totally right about that, of course-, but a very different thing is to calculate a contingency table in a specific way called "hypergeometric sampling" which every expected frequency factor it's from a multivariate hypergeometric and it's obtained by setting the marginal totals of the two factors before sampling. The totals of the factors may or may not be the same -it doesn't matter- but they must be prefixed. You confuse both and I'm not going to explain it one more time. – Ángel Jul 25 '18 at 12:18

Creating a contingency table by hypergeometric sampling with the Titanic's database

1 Answers1