Packed large bit vector with efficient xor and bit count in Haskell

Question

I am looking for an efficient (in both space and time) data type which can hold a 384 bit vector and supports efficient XOR and "bit count" (number of bits set to 1) operations.

Below, please find my demo program. The operations I need are all in the SOQuestionOps type class and I have implemented it for Natural and Data.Vector.Unboxed.Bit. Especially the latter seems perfect as it has a zipWords operation which should allow me to do operations like "bit count" and XOR word-by-word instead of bit-by-bit. Also it claims to store the bits packed (8 bits per byte).

{-# LANGUAGE FlexibleInstances #-}
import Data.Bits
import Data.List (foldl')
import Numeric.Natural
import qualified Data.Vector as V
import qualified Data.Vector.Unboxed.Bit as BV

class SOQuestionOps a where
    soqoXOR :: a -> a -> a
    soqoBitCount :: a -> Int
    soqoFromList :: [Bool] -> a

alternating :: Int -> [Bool]
alternating n =
    let c = n `mod` 2 == 0
     in if n == 0
           then []
           else c : alternating (n-1)

instance SOQuestionOps Natural where
    soqoXOR = xor
    soqoBitCount = popCount
    soqoFromList v =
        let oneIdxs = map snd $ filter fst (zip v [0..])
         in foldl' (\acc n -> acc `setBit` n) 0 oneIdxs

instance SOQuestionOps (BV.Vector BV.Bit) where
    soqoXOR = BV.zipWords xor
    soqoBitCount = BV.countBits
    soqoFromList v = BV.fromList (map BV.fromBool v)

main =
    let initialVec :: BV.Vector BV.Bit
        initialVec = soqoFromList $ alternating 384
        lotsOfVecs = V.replicate 10000000 (soqoFromList $ take 384 $ repeat True)
        xorFolded = V.foldl' soqoXOR initialVec lotsOfVecs
        sumBitCounts = V.foldl' (\n v -> n + soqoBitCount v) 0 lotsOfVecs
     in putStrLn $ "folded bit count: " ++ show (soqoBitCount xorFolded) ++ ", sum: " ++ show sumBitCounts

So let's calculate numbers for the best case: lotsOfVecs shouldn't need to allocate much because it's just 10,000,000 times the same vector initialVec. The foldl obviously creates one of these vectors per fold operation, so it should create 10,000,000 bit vectors. The bit counting should create anything but 10,000,000 Ints. So in the best case, my program should use very little (and constant) memory and the total allocations should roughly be 10,000,000 * sizeof(bit vector) + 10,000,000 * sizeof(int) = 520,000,000 bytes .

Ok, let's run the program for Natural:

let's make initialVec :: Natural, compile with

ghc --make -rtsopts -O3 MemStuff.hs

result (this is with GHC 7.10.1):

$ ./MemStuff +RTS -sstderr
folded bit count: 192, sum: 3840000000
1,280,306,112 bytes allocated in the heap
201,720 bytes copied during GC
80,106,856 bytes maximum residency (2 sample(s))
662,168 bytes maximum slop
78 MB total memory in use (0 MB lost due to fragmentation)

Tot time (elapsed)  Avg pause  Max pause
Gen  0      2321 colls,     0 par    0.056s   0.059s     0.0000s    0.0530s
Gen  1         2 colls,     0 par    0.065s   0.069s     0.0346s    0.0674s

INIT    time    0.000s  (  0.000s elapsed)
MUT     time    0.579s  (  0.608s elapsed)
GC      time    0.122s  (  0.128s elapsed)
EXIT    time    0.000s  (  0.002s elapsed)
Total   time    0.702s  (  0.738s elapsed)

%GC     time      17.3%  (17.3% elapsed)

Alloc rate    2,209,576,763 bytes per MUT second

Productivity  82.7% of total user, 78.7% of total elapsed


real    0m0.754s
user    0m0.704s
sys 0m0.037s

which has 1,280,306,112 bytes allocated in the heap, that's in the ballpark (2x) of the expected figure. Btw on GHC 7.8 this allocates 353,480,272,096 bytes and runs for absolute ages as popCount isn't very efficient on GHC 7.8's Naturals.

EDIT: I changed the code a bit. In the original version, every other vector was 0 in the fold. Which gave a lot better allocation figures for the Natural version. I changed it so the vector alternates between to different representations (with many bits set) and now we see 2x allocations of the expected. That's another downside of Natural (and Integer): The allocation rate depends on the values.

But maybe we can do better, let's try the densely packed Data.Vector.Unboxed.Bit:

That's initialVec :: BV.Vector BV.Bit and re-compile and re-run with the same options.

$ time ./MemStuff +RTS -sstderr
folded bit count: 192, sum: 1920000000
75,120,306,536 bytes allocated in the heap
54,914,640 bytes copied during GC
80,107,368 bytes maximum residency (2 sample(s))
664,128 bytes maximum slop
78 MB total memory in use (0 MB lost due to fragmentation)

Tot time (elapsed)  Avg pause  Max pause
Gen  0     145985 colls,     0 par    0.543s   0.627s     0.0000s    0.0577s
Gen  1         2 colls,     0 par    0.065s   0.070s     0.0351s    0.0686s

INIT    time    0.000s  (  0.000s elapsed)
MUT     time   27.679s  ( 28.228s elapsed)
GC      time    0.608s  (  0.698s elapsed)
EXIT    time    0.000s  (  0.002s elapsed)
Total   time   28.288s  ( 28.928s elapsed)

%GC     time       2.1%  (2.4% elapsed)

Alloc rate    2,714,015,097 bytes per MUT second

Productivity  97.8% of total user, 95.7% of total elapsed


real    0m28.944s
user    0m28.290s
sys 0m0.456s

That's very slow and roughly 100 times the allocations :(.

Ok, then lets recompile and profile both runs (ghc --make -rtsopts -O3 -prof -auto-all -caf-all -fforce-recomp MemStuff.hs):

The Natural version:

COST CENTRE         MODULE  %time %alloc
main.xorFolded      Main     51.7   76.0
main.sumBitCounts.\ Main     25.4   16.0
main.sumBitCounts   Main     12.1    0.0
main.lotsOfVecs     Main     10.4    8.0

The Data.Vector.Unboxed.Bit version:

COST CENTRE         MODULE  %time %alloc
soqoXOR             Main     96.7   99.3
main.sumBitCounts.\ Main      1.9    0.2

Is Natural really the best option for a fixed size bit vector? And what about GHC 6.8? And is there anything better which can implement my SOQuestionOps type class?

I would make a special purpose data type with 6 unpacked `Word64`, and then use the primitive operations word-wise on those. — augustss, Jul 23 '15 at 11:40
Hi @augustss, thanks! I thought about that and didn't do it in the end since I have lots of unit & QuickCheck tests which quickly solve sub-problems where the vectors are considerably shorter. Obviously, I could still always use the big vector type and only make use of the first `n` bits in the tests but I thought there might be a nice unpacked representation which works regardless of the number of bits. Also: For my program to generally solve the problem, the length of the bit vector is not restricted to 384 :(. So my program would only work if your input data is "small enough". — Johannes Weiss, Jul 23 '15 at 11:44
`Natural` and `Integer` are pretty good for this, since they use the GMP primitives which are optimized to the hilt. — András Kovács, Jul 23 '15 at 11:52
In the bitvector version, most of the time is spent in `soqoXOR`, which is implemented using `Data.Vector.Unboxed.Bit.zipWords`, which [converts back and forth between mutable and immutable vectors](http://hackage.haskell.org/package/bitvec-0.1.0.1/docs/src/Data-Vector-Unboxed-Bit.html#zipWords). I wonder if that explains the problem, and if using the mutable version directly in `ST` would help. — duplode, Jul 23 '15 at 11:53
@duplode thanks! In my real use case I need every produced vector though. So the code doesn't look too bad to me: `xs <- V.thaw xs` makes one copy, `B.zipInPlace op xs ys` looks like it zips in place and `Unsafe.unsafeFreeze xs` makes it immutable in constant time. So I'd expect it to make exactly one copy per iteration which is what I want. Do you agree? — Johannes Weiss, Jul 23 '15 at 12:25
@AndrásKovács ok, only on GHC 7.10 though, 7.8 is hideously slow though :). And still on 7.10 it allocates twice as much as needed :(. — Johannes Weiss, Jul 23 '15 at 12:27
About thaw and freeze, that sounds reasonable indeed. As for the 2x allocation, it is not unexpected given the overhead of the `data` constructor of the `Natural`s (cf. [this question](http://stackoverflow.com/questions/3254758/memory-footprint-of-haskell-data-types)). — duplode, Jul 23 '15 at 12:42
@duplode true, so that should add `10,000,000 * sizeof(Word)` which is `8` for me so add `80,000,000`. The expected figure is then `600,000,000`, however `1,280,306,112` is still a tiny little bit more than twice the figure, right? — Johannes Weiss, Jul 23 '15 at 12:58
According to that answer, it would be two words for a (small) `Integer`, plus two words for a (single field) constructor, so 4 words per `Natural`. If I'm not messing up the calculations you show in the question, that would give `(10kk * 384 / 8) + (10kk * 4 * 8) = 800,000,000`, which is halfway through. — duplode, Jul 23 '15 at 13:20
@duplode thanks! You're absolutely right. So we're "only" missing `400,000,000` bytes, which is roughly `40` per created `Natural` (whose actual data only has `48` bytes). — Johannes Weiss, Jul 23 '15 at 14:03

score 1 · Answer 1 · answered Jul 23 '15 at 12:20

1

Have a look at the Data.LargeWord module in the Crypto package:

http://hackage.haskell.org/package/Crypto-4.2.5.1/docs/Data-LargeWord.html

It provides Bits instances for large words of various sizes, e.g. 96 through 256 bits.

answered Jul 23 '15 at 12:20

ErikR

51,541
9
73
124

I wonder why `Crypto` doesn't just depend on [`largeword`](http://hackage.haskell.org/package/largeword), a package with just the `LargeWord` module and written by the same author. See also the [`Sum`](http://hackage.haskell.org/package/set-cover-0.0.4/docs/Math-SetCover-Bit.html#t:Sum) type from `set-cover`, which is similar but uses strict fields. Probably if efficiency is a thing one would want to unpack them as well. – Daniel Wagner Jul 23 '15 at 23:06

Packed large bit vector with efficient xor and bit count in Haskell

1 Answers1