21

Benchmarking shows that the cereal library takes 100x longer to deserialize a data structure of mine (detailed below) than it takes to read the same data off the drive:

benchmarking Read
mean: 465.7050 us, lb 460.9873 us, ub 471.0938 us, ci 0.950
std dev: 25.79706 us, lb 22.19820 us, ub 30.81870 us, ci 0.950
found 4 outliers among 100 samples (4.0%)
  4 (4.0%) high mild
variance introduced by outliers: 53.460%
variance is severely inflated by outliers

benchmarking Read + Decode
collecting 100 samples, 1 iterations each, in estimated 6.356502 s
mean: 68.85135 ms, lb 67.65992 ms, ub 70.05832 ms, ci 0.950
std dev: 6.134430 ms, lb 5.607914 ms, ub 6.755639 ms, ci 0.950
variance introduced by outliers: 74.863%
variance is severely inflated by outliers

This is also supported by profiling typical deserialization usage of this data structure in a program of mine where 98% of the time is spent deserializing the data and 1% is IO plus the core algorithm:

COST CENTRE                    MODULE               %time %alloc

getWord8                       Data.Serialize.Get    30.5   40.4
unGet                          Data.Serialize.Get    29.5   17.9
getWord64be                    Data.Serialize.Get    14.0   10.7
getListOf                      Data.Serialize.Get    10.2   12.8
roll                           Data.Serialize         8.2   11.5
shiftl_w64                     Data.Serialize.Get     3.4    2.9
decode                         Data.Serialize         2.9    3.1
main                           Main                   1.3    0.6

The data structure I'm deserializing is an IntMap [Triplet Atom] and the definitions of the component types are given below:

type Triplet a = (a, a, a)

data Point = Point {
    _x :: {-# UNPACK #-} !Double ,
    _y :: {-# UNPACK #-} !Double ,
    _z :: {-# UNPACK #-} !Double }

data Atom = Atom {
    _serial :: {-# UNPACK #-} !Int    ,
    _r      :: {-# UNPACK #-} !Point  ,
    _n      :: {-# UNPACK #-} !Word64 }

I'm using the default IntMap, (,,) and [] instances provided by cereal, and the following types and instances for my custom types:

instance Serialize Point where
    put (Point x y z) = do
        put x
        put y
        put z
    get = Point <$> get <*> get <*> get

instance Serialize Atom where
    put (Atom s r n) = do
        put s
        put r
        put n
    get = Atom <$> get <*> get <*> get

So my questions are:

  1. Why is deserialization so slow in general?
  2. Is there any way to change my data structure (i.e. IntMap/[]) to make the deserialization go faster?
  3. Is there any way to change my data types (i.e. Atom/Point) to make deserialization go faster?
  4. Are there faster alternatives than cereal within Haskell, or should I store the data structure in C-land to do more rapid deserialization (i.e. use mmap)?

These files I am deserializing are being used for sub-indices for a search engine since the full index cannot fit in memory for the target computer (which is a consumer-grade desktop), so I store each sub-index on disk and read+decode the sub-indices pointed to by the initial global index residing in memory. Also, I'm not concerned about serialization speed since searching the index is the bottle-neck for the end user and the current serialization performance of cereal is satisfactory for generating and updating the index.

Edit:

Tried out Don's suggestion of using a space-efficient triplet, and this quadrupled the speed:

benchmarking Read
mean: 468.9671 us, lb 464.2564 us, ub 473.8867 us, ci 0.950
std dev: 24.67863 us, lb 21.71392 us, ub 28.39479 us, ci 0.950
found 2 outliers among 100 samples (2.0%)
  2 (2.0%) high mild
variance introduced by outliers: 50.474%
variance is severely inflated by outliers

benchmarking Read + Decode
mean: 15.04670 ms, lb 14.99097 ms, ub 15.10520 ms, ci 0.950
std dev: 292.7815 us, lb 278.8742 us, ub 308.1960 us, ci 0.950
variance introduced by outliers: 12.303%
variance is moderately inflated by outliers

However, it still remains the bottleneck using up 25x as much time as IO. Also, can anybody explain why Don's suggestion works? Does this mean if I switched to something other than a list (like an array?) that it might give an improvement, too?

Edit #2: Just switched to latest Haskell platform and reran profiling for cereal. The information is considerably more detailed and I've provided an hpaste of it.

Gabriella Gonzalez
  • 34,863
  • 3
  • 77
  • 135
  • You might have a small win replacing the `Triplet` with a more space efficient type, e.g. `Triplet !Atom !Atom !Atom`. And this is all with `-O2` (as optimizing instances is highly performance sensitive)? Finally, what results do you get with `Data.Binary` -- it can be faster, as it produces lazy output. `IntMap` may have an inefficient implementation as well -- alternative serializations may be faster. – Don Stewart Jun 05 '12 at 18:08
  • Yes, this is all with `-O2`. I don't think the laziness will give a significant advantage because a lot of these files (not the one I benchmarked above, but other ones) are within the chunk size of `ByteString` and they still give the same ratio of `decode` to `IO` speed. I will go ahead and test `Binary`, though, and see what I get. – Gabriella Gonzalez Jun 05 '12 at 18:11
  • The main reason why Data.Binary can be faster is that it can fill smaller bytestring chunks, with fewer reallocations. That said, I'd look closely at the instances for IntMap in cereal. – Don Stewart Jun 05 '12 at 19:00
  • I tried out both of your suggestions. Switching to `Binary` did not help, but using the space-efficient triplet did help. Also, I checked the `Intmap` instance for cereal and it's going through the list representation. – Gabriella Gonzalez Jun 05 '12 at 19:03
  • I'ven not tried it, but you may be able to traverse the IntMap deserializing directly, avoiding the intermediate list. Additionally, check that the instance methods are being inlined. – Don Stewart Jun 05 '12 at 19:09
  • You might want to look at how both libraries serialize Doubles, I suspect there could be quite a bit of variation there. – stephen tetley Jun 05 '12 at 20:31
  • Oh yes, that's also a good point. There's a special double instances - http://stackoverflow.com/questions/6976684/converting-ieee-754-floating-point-in-haskell-word32-64-to-and-from-haskell-floa . Finally, it looks like some of the instances in `cereal` aren't inlining properly. – Don Stewart Jun 05 '12 at 21:08
  • @DonStewart I will see if I can skip the lists (both for `IntMap` and my own use of list), as that seems to be the bulk of the problem, judging by the profiling data. The `getWord8`s might be being used for something with multiple constructors, which might be the list, and even `getListOf` is consuming a significant fraction of time. However, I was going to install the latest Haskell Platform today (mine's old), so let me do that first. I will then take `cereal` and compile from source and experiment with inlines. – Gabriella Gonzalez Jun 05 '12 at 21:16
  • You might want to increase the benchmark size 10x or 100x because 15ms are a very shot (and jittery) time span for a computer. This will improve accuracy. Anyway, your point stands. – usr Jun 05 '12 at 21:28
  • 1
    Ok, I just switched to latest Haskell platform, which gives [WAY better profiling information](http://hpaste.org/69575). – Gabriella Gonzalez Jun 05 '12 at 23:11
  • A new version of cereal was release on the same day you asked this question - with performance improvements listed. http://hackage.haskell.org/package/cereal though I'm very surprised there's no inlining on the Monad definition (so we see >>= show up in profiles). – Don Stewart Jun 06 '12 at 21:03
  • @DonStewart Yeah, I noticed that, too. The latest profile does show that the monad instance is the rate limiting step. Right now, though, I'm trying out refactoring my data structure to use `StorableArray` so that I can just `memcpy` the data structure directly from disk using `unsafeForeignPtrToStorableArray`. Also, I was a moron and forgot that my integer indices for the IntMap are contiguous, so I could use an array all along. My initial benchmarks show that this makes IO rate limiting now, and I will update after I finish implementing it. – Gabriella Gonzalez Jun 06 '12 at 21:29
  • 1
    You might look at Storable instances for Vector. They're generally more useful than arrays; and have mmap support. – Don Stewart Jun 06 '12 at 21:47
  • @DonStewart I used `Data.Vector.Storable` and it worked beautifully (> 10x speedup and I can still do more). I will do a write-up of the solution with final benchmarks after I'm done cleaning up the complete mess I made. – Gabriella Gonzalez Jun 07 '12 at 00:44

1 Answers1

9

Ok. To answer this with the summary of the advice. For fast deserialization of data:

  • Use cereal (strict bytestring output) or binary (lazy bytetring output)
  • Make sure you're compiling with -O2, as these libraries rely on inlining to remove overhead
  • Use dense data types, such as replacing a polymorphic tuple with an unpacked, specialized form.
  • Avoid converting data types to lists to serialize them. If you have bytestrings, this is taken care of. For unpacked array types, you usually will get very fast IO, but worth double checking the instances
  • You may be able to use mmap'd IO
  • For double-heavy data consider a more efficient double reader.
  • Use modern array and container types tuned for performance, with more recent GHC versions.
Don Stewart
  • 137,316
  • 36
  • 365
  • 468
  • Thanks so much for your help. One last thing. Could you please fix `vector-binary-instances` so that it doesn't trigger overlapping instances. There is another place in my project where I decided to use a `Vector` of non-`Storable` objects, and your `vector-binary-instances` was unusable, even with extensions. I had to copy and paste the source and specialize it to a `Vector` type to get it to work. – Gabriella Gonzalez Jun 07 '12 at 20:02