Improving scaling of sample from discrete distribution

Question

I have recently started playing around with Julia and I am currently working on a Monte Carlo simulation of some stochastic process on a 2-dimensional lattice. Each site has some associated rate of activation (the number of times it "does something" per second on average) which we can assume to be approximately constant. The order in which lattice sites activate is relevant so we need a method for randomly picking one particular site with probability proportional to its rate of activation.

It looks like sample(sites,weights(rates)) from the StatsBase package is exactly what I am looking for BUT from testing out my code structure (no logic, just loops and RNG), it turns out that sample() scales linearly with the number of sites. This means that overall my runtimes scale like N^(2+2), where N in the side length of my 2-dimensional lattice (one factor of 2 from the increase in total rate of activity, the other from the scaling of sample()).

Now, the increase in total rate of activity is unavoidable but I think the scaling of the "random pick with weights" method can be improved. More specifically, one should be able to achieve a logarithmic scaling with the number of sites (rather than linear). Consider for example the following function (and, please, forgive the poor coding)

function randompick(indices,rates)
    cumrates = [sum(rates[1:i]) for i in indices]
    pick = rand()*cumrates[end]
    tick = 0
    lowb = 0
    highb = indis[end]

    while tick == 0
        mid = floor(Int,(highb+lowb)/2)
        midrate = cumrates[mid]

        if pick > midrate
            lowb = mid
        else
            highb = mid
        end

        if highb-lowb == 1
            tick = 1
        end
    end
    return(highb)
end

Because we half the number of "pickable" sites at each step, it would take n steps to pick one specific site out of 2^n (hence the logarithmic scaling). However, in its current state randompick() is so much slower than sample() that scaling is practically irrelevant. Is there any way of reducing this method to a form that can compete with sample() and hence take advantage of the improved scaling?

EDIT: calculating cumrates scales like N^2 as well but that can be solved by working with rates in the correct (cumulative) form throughout the code.

Hi. I think that perhaps this question, or suggestion, is a better fit for the Julia discussion forum on https://discourse.julialang.org/latest Stack Overflow is more suited for concrete questions, while discourse is the place for ideas and discussions. You might also consider opening an issue on the StatsBase.jl github page: https://github.com/JuliaStats/StatsBase.jl — DNF, Aug 29 '18 at 18:28
Thanks for the suggestion. I guess I decided to post it here because I would have been happy with a solution specific to my case but I see that the core issue has little to do with the specific process I am simulating. In any case, for anyone interested, see my answer below for an efficient sampling method specific to my case that I found out about. — WSpine, Aug 30 '18 at 14:31

score 1 · Answer 1 · answered Aug 29 '18 at 18:55

1

A simpler version of what I think you were trying for is:

function randompick(rates)
    cumrates = cumsum(rates)
    pick = rand()*cumrates[end]
    searchsortedfirst(cumrates, pick)
end

The call to searchsortedfirst does scale logarithmically, but cumsum only scales linearly, thus eliminating any advantage this might have.

If the rates are constant, you could preprocess cumrates ahead of time, but if this was the case you would be better off using an alias table which can sample in constant time. There is an implementation available in the Distributions.jl package:

using Distributions

s = Distributions.AliasTable(rates)
rand(s)

answered Aug 29 '18 at 18:55

Simon Byrne

7,694
1
26
50

I realised a moment after posting the question that calculating the cumulative rates at each call would have set the scaling back to linear (hence my EDIT). I guess one would need to work with cumulative rates throughout to make this approach useful. Regarding your second comment, the rates are not constant (i.e. the rate associated with a specific site can change after some activation event at the site or in its neighbourhood) but can only take on a few pre-determined values. – WSpine Aug 30 '18 at 14:40
In that case, you may be able to exploit the structure of the problem, but we would need more information about how the rates can change. – Simon Byrne Aug 30 '18 at 16:27

score 0 · Answer 2 · answered Aug 30 '18 at 15:18

I found out about an alternative sampling method in this paper by P. Hanusse that does not seem to scale with N, at least when the allowed activity rates are of the same order of magnitude.

The idea is to assume that all sites have the same rate of activity, equal to the rate of activity of the most active site maxrate (so that the random pick is reduced to a single RNG call rand(1:N)). Once we have picked a site, we separate its (constant) rate of activity into two contributions, the original rate of activity and a "do-nothing" rate (the second being the constant rate minus its original rate). Now we generate a second random number c = rand() * maxrate. If c<rate[site], we keep that site choice and proceed to activate the site, otherwise we go back to the uniform random pick.

The function containing the two RNG calls would look like this, with the second returned value determining whether the call has to be repeated.

function HanussePick(rates,maxrate)
   site = rand(1:N^2)
   slider = rand() * maxrate
   return(site,rates[site]-slider)
end

The advantage of this approach is that, if the allowed rates of activity are comparable to each other, there should be no scaling with N, as we only need to generate O(1) random numbers.

Improving scaling of sample from discrete distribution

2 Answers2