Modification of the code to improve the simulation

Question

How to implement the code implementation assuming that the points is 10,000 and the 100,000 reps. So that the simulation performed correctly?

using Statistics, Random, DataFrames, DataFramesMeta, CSV, PyPlot

macro assertprob(x)
msg = string("wrong ", x, ": ")
:(0≤$(esc(x))≤1 || throw(ArgumentError(string($msg,$(esc(x))))))
end

function simulate(p::Float64, q::Float64)
@assertprob p
@assertprob q
@assertprob p+q

t = 0
while true
    t += 1
    r = rand()
    r < p && return t
    r < p + q && return missing
end
end

function getpoint()
while true
    p, q = rand(), rand()
    p + q ≤ 1 && return (p, q)
end
end

function runsim(points=10^3, reps=10^3)
df = DataFrame(p=Float64[], q=Float64[], rep=Int[],
               sim=Union{Int,Missing}[])
for i in 1:points
    p, q = getpoint()
    for j in 1:reps
        push!(df, (p, q, j, simulate(p, q)))
    end
end
df
end

function analyzesim(df)
@linq df |>
by([:p, :q], msim=mean(collect(skipmissing(:sim)))) |>
transform(mtheory=1 ./ (:p .+ :q)) |>
with(scatter(:msim, :mtheory))
end

Random.seed!(1)
df = runsim()
CSV.write("results.txt", df)
analyzesim(df)

Does anybody have an idea ? Thank you in advance for your help ?

Welcome to Stack Overflow! "How to improve the code" is not very specific. Can you please [edit] your question to let us know what the expected result is and how your code fails to achieve it? [Reading this](https://stackoverflow.com/help/how-to-ask) may be useful — Klaus Gütter, Feb 02 '19 at 09:26
For an increased size of the problem your current code would create a large `DataFrame`, so you can consider performing aggregation of the data in the inner loop and only store the aggregates needed in the output data frame. — Bogumił Kamiński, Feb 02 '19 at 09:53
@JohnSnow If your code works correctly, but you are looking for improvements, consider posting this on [Code Review](https://codereview.stackexchange.com/). — phipsgabler, Feb 02 '19 at 16:08

stillearningsomething · Answer 1 · 2019-02-04T06:18:09.880

While I highly agree with the above comments (this forum is not the place for code reviews and your question is poorly worded), it is a rough life for the Snows.

Ill try and address the question "implement the (pseudo?) code [s]o that the simulation performed correctly", but since the code runs, we can look at other possible issues. However, it all depends on my understanding of what value you are trying to monte-carlo

simulate returns a missing if p > r < p+q < 1. If that is an incorrect case, maybe repeat the experiment rather returning a missing. Especially since you are dropping missings.
simulate returns either a missing or a Float64. This is not type stable, and is not prefered.
Similarly, the code dynamically grows df. This is also a performance hit and should be pre-allocated.
I'm not sure a DataFrame is actually needed. The code can store p and q each in two vectors npoints long, and the results t from simulate in a matrix reps by npoints then average along the first dimension (along the reps)

Finally, there is a lot of math that can be done to speed up performance. p and q are drawn uniformly from the first quadrant of the L1 norm ball. Said another way, the probability of any pair (p, q) is 1 if it is in that region.

A little math yields the marginal density and cumulative prob functions as: f(p) = 2(1-p) and F(p) = p*(2-p). (Math was done to normalize). So you can draw a random p by first drawing u = rand() and solving u = 2p - p^2. Then draw a random q = (1-p)*rand().

Finally, if I am understanding the simulation correctly, you are calculating the expected number of tries to get an r = rand() < p, but drops the expectation if p < r < p+q.

However, the code counts only the how many tries it took to escape the [p+q, 1] region, not if final sample was in the middle (p, p+q) Region because the code drops all missing and does not look at how many of the repetitions were equal to missing (on average q/(p+q)*reps). This means the data are not missing at random but instead missing with probability q/(p+q), which will bias the estimate of t away from one of the corners of the (p,q) support. Also, the number off non-missing samples and therefore the variance of the estimate t s dependent on q as well, introducing additional considerations for running a goodness of fit or p-value statistic on your data

I'm actually quite glad I sat down with a pen and paper to think this through. The math to calculate the number of samples needed to land in [0, p] but not [p, p+q] (p/(p+q)^2) was fun, and it took me a while to think through why msim was about1/(p+q).

I hope this helps!

Modification of the code to improve the simulation

1 Answers1