I am looking for an efficient (in both space and time) data type which can hold a 384 bit vector and supports efficient XOR and "bit count" (number of bits set to 1) operations.
Below, please find my demo program. The operations I need are all in the SOQuestionOps
type class and I have implemented it for Natural
and Data.Vector.Unboxed.Bit
. Especially the latter seems perfect as it has a zipWords
operation which should allow me to do operations like "bit count" and XOR word-by-word instead of bit-by-bit. Also it claims to store the bits packed (8 bits per byte).
{-# LANGUAGE FlexibleInstances #-}
import Data.Bits
import Data.List (foldl')
import Numeric.Natural
import qualified Data.Vector as V
import qualified Data.Vector.Unboxed.Bit as BV
class SOQuestionOps a where
soqoXOR :: a -> a -> a
soqoBitCount :: a -> Int
soqoFromList :: [Bool] -> a
alternating :: Int -> [Bool]
alternating n =
let c = n `mod` 2 == 0
in if n == 0
then []
else c : alternating (n-1)
instance SOQuestionOps Natural where
soqoXOR = xor
soqoBitCount = popCount
soqoFromList v =
let oneIdxs = map snd $ filter fst (zip v [0..])
in foldl' (\acc n -> acc `setBit` n) 0 oneIdxs
instance SOQuestionOps (BV.Vector BV.Bit) where
soqoXOR = BV.zipWords xor
soqoBitCount = BV.countBits
soqoFromList v = BV.fromList (map BV.fromBool v)
main =
let initialVec :: BV.Vector BV.Bit
initialVec = soqoFromList $ alternating 384
lotsOfVecs = V.replicate 10000000 (soqoFromList $ take 384 $ repeat True)
xorFolded = V.foldl' soqoXOR initialVec lotsOfVecs
sumBitCounts = V.foldl' (\n v -> n + soqoBitCount v) 0 lotsOfVecs
in putStrLn $ "folded bit count: " ++ show (soqoBitCount xorFolded) ++ ", sum: " ++ show sumBitCounts
So let's calculate numbers for the best case: lotsOfVecs
shouldn't need to allocate much because it's just 10,000,000 times the same vector initialVec
. The foldl obviously creates one of these vectors per fold operation, so it should create 10,000,000 bit vectors. The bit counting should create anything but 10,000,000 Int
s. So in the best case, my program should use very little (and constant) memory and the total allocations should roughly be 10,000,000 * sizeof(bit vector) + 10,000,000 * sizeof(int) = 520,000,000 bytes .
Ok, let's run the program for Natural
:
let's make initialVec :: Natural
, compile with
ghc --make -rtsopts -O3 MemStuff.hs
result (this is with GHC 7.10.1):
$ ./MemStuff +RTS -sstderr
folded bit count: 192, sum: 3840000000
1,280,306,112 bytes allocated in the heap
201,720 bytes copied during GC
80,106,856 bytes maximum residency (2 sample(s))
662,168 bytes maximum slop
78 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 2321 colls, 0 par 0.056s 0.059s 0.0000s 0.0530s
Gen 1 2 colls, 0 par 0.065s 0.069s 0.0346s 0.0674s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.579s ( 0.608s elapsed)
GC time 0.122s ( 0.128s elapsed)
EXIT time 0.000s ( 0.002s elapsed)
Total time 0.702s ( 0.738s elapsed)
%GC time 17.3% (17.3% elapsed)
Alloc rate 2,209,576,763 bytes per MUT second
Productivity 82.7% of total user, 78.7% of total elapsed
real 0m0.754s
user 0m0.704s
sys 0m0.037s
which has 1,280,306,112 bytes allocated in the heap
, that's in the ballpark (2x) of the expected figure. Btw on GHC 7.8 this allocates 353,480,272,096 bytes and runs for absolute ages as popCount
isn't very efficient on GHC 7.8's Natural
s.
EDIT: I changed the code a bit. In the original version, every other vector was 0
in the fold. Which gave a lot better allocation figures for the Natural
version. I changed it so the vector alternates between to different representations (with many bits set) and now we see 2x
allocations of the expected. That's another downside of Natural
(and Integer
): The allocation rate depends on the values.
But maybe we can do better, let's try the densely packed Data.Vector.Unboxed.Bit
:
That's initialVec :: BV.Vector BV.Bit
and re-compile and re-run with the same options.
$ time ./MemStuff +RTS -sstderr
folded bit count: 192, sum: 1920000000
75,120,306,536 bytes allocated in the heap
54,914,640 bytes copied during GC
80,107,368 bytes maximum residency (2 sample(s))
664,128 bytes maximum slop
78 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 145985 colls, 0 par 0.543s 0.627s 0.0000s 0.0577s
Gen 1 2 colls, 0 par 0.065s 0.070s 0.0351s 0.0686s
INIT time 0.000s ( 0.000s elapsed)
MUT time 27.679s ( 28.228s elapsed)
GC time 0.608s ( 0.698s elapsed)
EXIT time 0.000s ( 0.002s elapsed)
Total time 28.288s ( 28.928s elapsed)
%GC time 2.1% (2.4% elapsed)
Alloc rate 2,714,015,097 bytes per MUT second
Productivity 97.8% of total user, 95.7% of total elapsed
real 0m28.944s
user 0m28.290s
sys 0m0.456s
That's very slow and roughly 100 times the allocations :(.
Ok, then lets recompile and profile both runs (ghc --make -rtsopts -O3 -prof -auto-all -caf-all -fforce-recomp MemStuff.hs
):
The Natural
version:
COST CENTRE MODULE %time %alloc
main.xorFolded Main 51.7 76.0
main.sumBitCounts.\ Main 25.4 16.0
main.sumBitCounts Main 12.1 0.0
main.lotsOfVecs Main 10.4 8.0
The Data.Vector.Unboxed.Bit
version:
COST CENTRE MODULE %time %alloc
soqoXOR Main 96.7 99.3
main.sumBitCounts.\ Main 1.9 0.2
Is Natural
really the best option for a fixed size bit vector? And what about GHC 6.8? And is there anything better which can implement my SOQuestionOps
type class?