2

With R I can try to find the probability that the Age vector below resulted from random sampling. I used the runs test (from randtests package) with resulted in p-value = 0.2892. Other colleagues used the rle functune (run length encoding in R) or others to simulate whether the probabilities of random allocation generating the observed sequences. Their result shows p < 0.00000001 that this sequence is the result of random sampling. I am trying to find the R code to replicate their findings. any help is highly appreciated on how to simulate to replicate their findings.

Update: I received advice from statistician that I can do this using non-parametric bootstrap. However, I still do not know how this can be done. I appreciate your help.

example:

Age <-c(68,71,72,69,80,78,80,81,84,82,67,73,65,68,66,70,69,72,74,73,68,75,70,72,75,73,69,75,74,79,80,78,80,81,79,82,69,73,67,66,70,72,69,72,75,80,68,69,71,77,70,73)                                                  ;
randtests::runs.test(Age);
X <- rle(Age);X$lengths
  • `p < 0.00000001` seems absurd for such a small sample. In any event -- p-values are relative to some hypothesis. Without knowing exactly what their hypothesis was, it is difficult to replicate their p-value. Perhaps you can ask your colleagues to clarify how they computed their p-value. That would make more sense than asking random strangers on the internet who don't know the context. – John Coleman Nov 21 '21 at 15:06
  • Their hypothesis is this sequence of data didn’t follow a random distribution or a result of chance. They compared this sequence to expected random sequence they generated by simulation to concluded the mentioned probability of non randomness. I would need to replicate it myself but do not know the code. Any help is highly appreciated. – Mohamed Fawzy Nov 21 '21 at 17:41
  • Random sampling *of what*? The numbers don't look like a random sampling of integers in the range 65 to 84. There seems to be some context missing. In any event, the onus is on people who make a statistical claim to clearly state what that claim is and to provide enough information for others to replicate their results. If you are trying to guess what is behind your colleagues' conclusions, then they have failed in the task of being good communicators. Why not ask them to clarify? – John Coleman Nov 21 '21 at 18:30
  • They stated that they used resampling to compare the mentioned sequence to an expected random sequence created by simulation. In their paper, they did not provide the code or any further points except this text that I copy from their paper “ I also used resampling to calculate the probability of runs of the same value in columns”. They failed to respond to those asked them directly. The only option is to seek help from R experts to try to guess the code. – Mohamed Fawzy Nov 21 '21 at 18:42
  • If this comes from a published paper, perhaps you can edit the question to include a citation. – John Coleman Nov 21 '21 at 18:44
  • Thank you for your comments. I just need to the code to try to replicate with no aim to criticize or open a discussion regarding a publication. – Mohamed Fawzy Nov 21 '21 at 19:22
  • You are the one that mentioned a paper. If such a paper exists and is available then it might shed some light on what they were doing. Also -- you never answered the question I asked about what is being sampled. The numbers are manifestly not integers which are randomly chosen uniformly from their range. What exactly is the random hypothesis in this case? – John Coleman Nov 22 '21 at 00:48
  • This is the DOI for the paper (https://doi.org/10.1111/anae.15263). It is related to integrity of clinical trials. The variable mentioned is age of example 14 (Trial 453) in the supplement of the paper. The authors claim that they are able to simulate by resampling to compare the randomness of sequences in the variable against expected random values generated by simulation with resampling. I very much appreciate your help verifying what they did. – Mohamed Fawzy Nov 22 '21 at 02:32

1 Answers1

1

What was initially presented isn't the whole story. If one looks at the supplement where these numbers are from, the reported p-value is for comparing two vectors. OP only provides one, and hence the task is not well-defined.

The full assertion of the research article is that

group1 <- c(68,71,72,69,80,78,80,81,84,82,67,73,65,68,66,70,69,72,74,73,68,75,70,72,75,73)
group2 <- c(69,75,74,79,80,78,80,81,79,82,69,73,67,66,70,72,69,72,75,80,68,69,71,77,70,73)

being two independent random samples has a p-value < 0.00000001.

Even checking identity along position (10 entries in original) with permutations within a group, I'm seeing only 2 or 3 draws per million that have a similar number of identical values. I.e., something like:

set.seed(123)
mean(replicate(1e6, sum(sample(group1, length(group1)) == group2)) >= 10)
# 2e-06

Testing correlations and/or bootstrapping could easily be in the p-value range that is reported (nothing as extreme in 100 million simulations).

merv
  • 67,214
  • 13
  • 180
  • 245
  • Merv, thank you so much. I believe the two groups were not compared to one another. They used nonparametric bootstrapping with resampling and replacement to generate expected random variables of the same length. Then they compared the existing variable's run length encoding using rle Function to the generated variable's run length encoding to determine the probability of the randomness of sequences generated by resampling versus the existing variable. I appreciate if you can help solving this issue for me. – Mohamed Fawzy Nov 23 '21 at 14:55
  • I used one group to make the code short but the two groups are merged to one variable to be used for resampling and comparison. group1 <-c(68,71,72,69,80,78,80,81,84,82,67,73,65,68,66,70,69,72,74,73,68,75,70,72,75,73,69,75,74,79,80,78,80,81,79,82,69,73,67,66,70,72,69,72,75,80,68,69,71,77,70,73) – Mohamed Fawzy Nov 23 '21 at 14:55