13

I have a 10GB Data.Vector.Unboxed vector that I want to efficiently save to disk. What's the best, most efficient way? I plan to read it from a memory-mapped file too.

I have seen this package this package but only works with Storable but I need to stay with unboxed.

I was thinking of converting to a list but I am assuming this is not very ideal.

jap
  • 617
  • 3
  • 13
  • Why do you need to stay with Unboxed? [As I've pointed out before](http://stackoverflow.com/a/21897900/925978), I'm not aware of any difference between the two. – crockeea Apr 24 '14 at 01:19
  • I couldn't experiment with Storable because it doesn't support zip and my code relies on that. I could use zipWith but that involve major re-factoring. – jap Apr 24 '14 at 08:20
  • Also one of the big difference is that you have to add storable instances everywhere like if I want to zip, I would need to add, for example, (Int, Int) instances and I would need to add a vast amount – jap Apr 24 '14 at 08:49
  • Of course Storables support zipping: [link](http://hackage.haskell.org/package/vector-0.10.0.1/docs/Data-Vector-Storable.html#g:22). And just like you need `Storable` instances for new datas, you also need `Unbox` instances to use them with `Unbox` vectors. They're quite similar. If you're looking for a `Storable` tuple instance specifically, there's a package [here](http://hackage.haskell.org/package/storable-tuple). – crockeea Apr 24 '14 at 12:45

3 Answers3

5

You can convert between Vector types at the cost of an O(n) traversal of the entire vector. The function you're looking for is convert. As long as you're not planning to write this vector out to disk often, this cost should not be significant over all, and certainly faster than actually writing the vector out to disk. However, if you find yourself paying this cost often, you should probably rethink the algorithm.

cassandracomar
  • 1,491
  • 7
  • 16
  • I plan to write to disk once every day – jap Apr 23 '14 at 16:17
  • Then my answer should be sufficient. It's one extra O(n) pass over the elements in the vector (note that this only deals with the references to the data, it's not touching all 10GB) right before you write. – cassandracomar Apr 23 '14 at 18:19
4

I haven't tested it myself, but you could try to use vector-binary-instances, which provides Binary instances for Vectors, and then use binary, for example encodeFile.

Petr
  • 62,528
  • 13
  • 153
  • 317
0

What about memory mapping the C array underlying the vector? Of course that works only if the Vector is unboxed :-).

Writing would then comprise of taking the pointer to the array, total C size of the array, and writing the C memory chunk with a single C call.

Michal Gajda
  • 603
  • 5
  • 13