ByteString histogram

Question

Given a (strict) ByteString, what is the most efficient way to count how many of each possible byte it contains?

I see that sort is supposed to implemented as a counting sort - but there doesn't appear to be a way to access the associated counts. I also see that there's a count function, which counts the number of times a given byte appears. This gives me the following options:

map (\ b -> count b str) [0x00 .. 0xFF]
map length . group . sort
Something with fold* and an IntMap of byte frequencies.

Which is likely to give me best performance?

`IntMap` is the best solution (especially for large strings). — josejuan, Jun 15 '13 at 14:48
@josejuan Based on what evidence? Of the three mentioned in the question, the length/group/sort solution is actually much better. — Thomas M. DuBuisson, Jun 15 '13 at 16:55
Ok, ok, you are using `B.sort` that use... `unsafeCreate`! (to count) that's cheat! :D :D :D — josejuan, Jun 15 '13 at 18:44

score 5 · Accepted Answer · edited May 23 '17 at 10:25

The basic idea of dflemstr is of course right, but since you want the best performance, you need to use unchecked access to the ByteString as well as to the counting array, like

import Control.Monad.ST
import Data.Array.ST
import Data.Array.Base (unsafeRead, unsafeWrite)
import Data.Array.Unboxed

import Data.Word

import Data.ByteString (ByteString)
import qualified Data.ByteString as BS
import Data.ByteString.Unsafe

histogram :: ByteString -> UArray Word8 Int
histogram bs = runSTUArray $ do
    hist <- newArray (0, 255) 0
    let l = BS.length bs
        update b = do
            o <- unsafeRead hist b
            unsafeWrite hist b (o+1)
        loop i
            | i < 0     = return hist
            | otherwise = do
                update $ fromIntegral (bs `unsafeIndex` i)
                loop (i-1)
    loop (l-1)

That makes a considerable difference, according to criterion (building a histogram of a 200000 long ByteString):

warming up
estimating clock resolution...
mean is 1.667687 us (320001 iterations)
found 3078 outliers among 319999 samples (1.0%)
  1947 (0.6%) high severe
estimating cost of a clock call...
mean is 40.43765 ns (14 iterations)

benchmarking dflemstr
mean: 21.42852 ms, lb 21.05213 ms, ub 21.77954 ms, ci 0.950
std dev: 1.873897 ms, lb 1.719565 ms, ub 2.038779 ms, ci 0.950
variance introduced by outliers: 74.820%
variance is severely inflated by outliers

benchmarking unsafeIndex
mean: 312.6447 us, lb 304.3425 us, ub 321.0795 us, ci 0.950
std dev: 42.86886 us, lb 39.64363 us, ub 46.52899 us, ci 0.950
variance introduced by outliers: 88.342%
variance is severely inflated by outliers

(I changed dflemstr's code to also use runSTUArray and return a UArray Word8 Int to have uiform return values, that doesn't make a big difference in running time, though.)

`mean: 41.18654 ms, lb 41.05191 ms, ub 41.39745 ms, ci 0.950` for `chistogram bs = array (0,255) [(b, B.count b bs) | b <- [0 .. 255]]`. I can imagine one could speed that up, but since it has to do 256 passes over the `ByteString`, that can't do too well. — Daniel Fischer, Jun 15 '13 at 15:49
I understand, but I feel some "Haskell disappointed" when I see `unsafe`... :/ — josejuan, Jun 15 '13 at 18:51
@josejuan In this case, it's only an unfortunate choice of name, it should have been `unsafeIfYouDidn'tCheckTheValidityOfIndexButPerfectlyFineIfYouDidRead/Write/Index`, but that is too long. It's an entirely different kind of "unsafe" than `unsafePerformIO`. — Daniel Fischer, Jun 15 '13 at 18:55
Thanks a lot @DanielFischer for your reply, I'm a (4ever) rookie and must investigate it (but not smell right :D :D) — josejuan, Jun 15 '13 at 19:00

score 4 · Answer 2 · answered Jun 15 '13 at 14:56

The most efficient method probably involves using a mutable array for storing the counts. This is potentially one of the most efficient O(n) solutions available:

import Control.Monad
import Control.Monad.ST

import Data.Array.ST

import Data.ByteString (ByteString)
import qualified Data.ByteString as ByteString

import Data.Word

byteHistogram :: ByteString -> [Int]
byteHistogram bs = runST $ do
  histogram <- newArray (minBound, maxBound) 0 :: ST s (STUArray s Word8 Int)
  forM_ (ByteString.unpack bs) $ \ byte ->
    readArray histogram byte >>= return . (+1) >>= writeArray histogram byte
  getElems histogram

score 3 · Answer 3 · answered Jun 15 '13 at 15:53

Well you can guess or you can make a program and measure it - the results can surprise you.

import Data.ByteString as B
import Data.IntMap.Strict as I
import qualified Data.Vector.Unboxed.Mutable as M
import Data.Vector.Unboxed as V
import Criterion
import Criterion.Main
import System.Entropy
import System.IO.Unsafe
import Data.Word

main = do
    bs <- getEntropy 1024
    defaultMain [ bench "map count" $ nf mapCount bs
                , bench "map group sort" $ nf mapGroup bs
                , bench "fold counters" $ nf mapFoldCtr bs
                , bench "vCount" $ nf vectorCount bs
                ]

-- O(n*m) (length of bytestring, length of list of element being counted up)
-- My guess: bad
mapCount :: ByteString -> [Int]
mapCount bs = Prelude.map (`B.count` bs) [0x00..0xFF]

-- Notice that B.sort uses counting sort, so there's already lots of
-- duplicate work done here.
-- O() isn't such a big deal as the use of lists - likely allocation and
-- large constant factors.
mapGroup :: ByteString -> [Int]
mapGroup = Prelude.map Prelude.length . Prelude.map B.unpack . B.group . B.sort

mapFoldCtr :: ByteString -> [Int]
mapFoldCtr bs  = I.elems $ B.foldl' cnt I.empty bs
 where
 cnt :: I.IntMap Int -> Word8 -> I.IntMap Int
 cnt m k = I.insertWith (+) (fromIntegral k) 1 m

 -- create (do { v <- new 2; write v 0 'a'; write v 1 'b'; return v })
vectorCount :: B.ByteString -> [Int]
vectorCount bs = V.toList $ V.create $ do
        v <- M.new 256
        Prelude.mapM_ (\i -> M.unsafeWrite v i 0) [0..255]
        Prelude.mapM_ (\i -> M.unsafeRead v (fromIntegral i) >>= M.unsafeWrite v (fromIntegral i) . (+1) ) (B.unpack bs)
        return v

And the results (shortened) reflect amazingly well on map group sort but leave the unboxed mutable vector/array style solution unsurprisingly in the lead:

benchmarking map count
mean: 308.7067 us, lb 307.3562 us, ub 310.5099 us, ci 0.950
std dev: 7.942305 us, lb 6.269114 us, ub 10.08334 us, ci 0.950

benchmarking map group sort
mean: 43.03601 us, lb 42.93492 us, ub 43.15815 us, ci 0.950
std dev: 567.5979 ns, lb 486.8838 ns, ub 666.0098 ns, ci 0.950

benchmarking fold counters
mean: 191.5338 us, lb 191.1102 us, ub 192.0366 us, ci 0.950
std dev: 2.370183 us, lb 1.995243 us, ub 2.907595 us, ci 0.950

benchmarking vCount
mean: 12.98727 us, lb 12.96037 us, ub 13.02261 us, ci 0.950
std dev: 156.6505 ns, lb 123.6789 ns, ub 198.4892 ns, ci 0.950

Oddly, when I increase the bytestring size to 200K, as Daniel used, then map/group/sort clocks in at ~250us while the vector solution takes over 500us:

benchmarking map count
mean: 5.796340 ms, lb 5.788830 ms, ub 5.805126 ms, ci 0.950
std dev: 41.65349 us, lb 35.69293 us, ub 48.39205 us, ci 0.950

benchmarking map group sort
mean: 260.7405 us, lb 259.2525 us, ub 262.4742 us, ci 0.950
std dev: 8.247289 us, lb 7.127576 us, ub 9.371299 us, ci 0.950

benchmarking fold counters
mean: 3.915101 ms, lb 3.892415 ms, ub 4.006287 ms, ci 0.950
std dev: 201.7632 us, lb 43.13063 us, ub 469.8170 us, ci 0.950

benchmarking vCount
mean: 556.5588 us, lb 545.4895 us, ub 567.9318 us, ci 0.950
std dev: 57.44888 us, lb 51.22270 us, ub 65.91105 us, ci 0.950
found 1 outliers among 100 samples (1.0%)
variance introduced by outliers: 80.038%
variance is severely inflated by outliers

But that variance is tremendous - perhaps some playing with heap sizes would make that go away (in the benchmark program at least), but not quickly or easily for me.

josejuan · Answer 4 · 2013-06-15T23:06:31.867

(Do not take it very seriously)

The (actual) fastest solution, and pure FP solution is this... nearly:

data Hist = Hist {v00 :: Int, v01 :: Int {- , v02 :: Int, ... -} }

emptyHist :: Hist
emptyHist = Hist 0 0 {- 0 0 ... -}

foldRecord :: B.ByteString -> [Int]
foldRecord = histToList . B.foldl' cnt emptyHist
  where histToList (Hist x00 x01 {- x02 ... -}) = [x00, x01 {- , x02, ... -}]
        cnt (Hist !x00 !x01)    0x00      = Hist (x00 + 1) x01 {- x02 ... -}
        cnt (Hist !x00 !x01) {- 0x01 -} _ = Hist x00 (x01 + 1) {- x02 ... -}
        {- ... -}

Using @Thomas benchmarks runs in 11.67 us (the previous fastest vCount take 14.99 us in my machine).

The problem is when cnt is splitted into 256 posible patterns (a full equivalent code using lens is here).

The compiler is slow selecting the proper pattern (left side of cnt) or incrementing (right side of cnt) but I think should generate efficient code (at least, equal efficience of two patterns).

(Using 256 cnt patterns and 256 Hist values take 1.35 ms!!!)

(In my machine, map group sort take 42 us, behind vCount alternative)

ByteString histogram

4 Answers4