While I highly agree with the above comments (this forum is not the place for code reviews and your question is poorly worded), it is a rough life for the Snows.
Ill try and address the question "implement the (pseudo?) code [s]o that the simulation performed correctly", but since the code runs, we can look at other possible issues. However, it all depends on my understanding of what value you are trying to monte-carlo
simulate
returns a missing
if p > r < p+q < 1
. If that is an incorrect case, maybe repeat the experiment rather returning a missing. Especially since you are dropping missings.
simulate
returns either a missing
or a Float64
. This is not type stable, and is not prefered.
- Similarly, the code dynamically grows
df
. This is also a performance hit and should be pre-allocated.
- I'm not sure a DataFrame is actually needed. The code can store
p
and q
each in two vectors npoints
long, and the results t
from simulate
in a matrix reps
by npoints
then average along the first dimension (along the reps
)
Finally, there is a lot of math that can be done to speed up performance.
p
and q
are drawn uniformly from the first quadrant of the L1 norm ball.
Said another way, the probability of any pair (p, q)
is 1 if it is in that region.
A little math yields the marginal density and cumulative prob functions as: f(p) = 2(1-p)
and F(p) = p*(2-p)
. (Math was done to normalize). So you can draw a random p
by first drawing u = rand()
and solving u = 2p - p^2
. Then draw a random q = (1-p)*rand()
.
Finally, if I am understanding the simulation correctly, you are calculating the expected number of tries to get an r = rand() < p
, but drops the expectation if p < r < p+q
.
However, the code counts only the how many tries it took to escape the [p+q, 1]
region, not if final sample was in the middle (p, p+q)
Region because the code drops all missing
and does not look at how many of the repetitions were equal to missing
(on average q/(p+q)*reps
). This means the data are not missing at random but instead missing with probability q/(p+q)
, which will bias the estimate of t
away from one of the corners of the (p,q) support.
Also, the number off non-missing samples and therefore the variance of the estimate t
s dependent on q
as well, introducing additional considerations for running a goodness of fit or p-value statistic on your data
I'm actually quite glad I sat down with a pen and paper to think this through. The math to calculate the number of samples needed to land in [0, p]
but not [p, p+q]
(p/(p+q)^2
) was fun, and it took me a while to think through why msim
was about1/(p+q)
.
I hope this helps!