I have written two functions to pick a random element out of a list of unknown length. The first uses reservoir sampling (with a reservoir of size 1), and the second gets the length of the list to pick a random index and return it. For some reason, the former is much faster.
The first function uses a single traversal and pick each element with probability (1/i), where i is the index of the element in the list. It results in a equal probability of picking each element.
pickRandom :: [a] -> IO a
pickRandom [] = error "List is empty"
pickRandom (x:xs) = do
stdgen <- newStdGen
return (pickRandom' xs x 1 stdgen)
-- Pick a random number using reservoir sampling
pickRandom' :: (RandomGen g) => [a] -> a -> Int -> g -> a
pickRandom' [] xi _ _ = xi
pickRandom' (x:xs) xi n gen =
let (rand, gen') = randomR (0, n) gen in
if (rand == 0) then
pickRandom' xs x (n + 1) gen' -- Update value
else
pickRandom' xs xi (n + 1) gen' -- Keep previous value
The second version traverses the list once to get its length, and then picks an index between 0 and the length of the input list (-1) to get one of the element, again with equal probability. The expected number of traversal of the list 1.5:
-- Traverses the list twice
pickRandomWithLen :: [a] -> IO a
pickRandomWithLen [] = error "List is empty"
pickRandomWithLen xs = do
gen <- newStdGen
(e, _) <- return $ randomR (0, (length xs) - 1) gen
return $ xs !! e
Here is the code I use for benchmarking these two functions:
main :: IO ()
main = do
gen <- newStdGen
let size = 2097152
inputList = getRandList gen size
defaultMain [ bench "Using length" (pickRandomWithLen inputList)
, bench "Using reservoir" (pickRandom inputList)
]
Here is a stripped output:
benchmarking Using reservoir
mean: 82.72108 ns, lb 82.02459 ns, ub 83.61931 ns, ci 0.950
benchmarking Using length
mean: 17.12571 ms, lb 16.97026 ms, ub 17.37352 ms, ci 0.950
In other terms, the first function is about 200 times faster than the second. I expected the runtime to be influenced mainly by random number generation and the number of list traversals (1 vs. 1.5). What other factors can explain such a huge difference?