2

This is an attempt to improve my Perl: seek to and read bits, not bytes by explaining more thoroughly what I was trying to do.

I have x, a 9136 x 42 array of integers that I want to store super-efficiently in a file. The integers have the following constraints:

  • All of the 9136 integers in x[0..9135][0] are between -137438953472 and 137438953471, and can therefore be stored using 38 bits.

  • All of the 9136 integers in x[0..9135][1] are between -16777216 and 16777215, and can therefore be stored using 25 bits.

  • And so on... (the integer bit constraints are known in advance; Perl doesn't have to compute them)

Question: Using Perl, how do I efficiently store this array in a file?

Notes:

  • If an integer can be stored in 25 bits, it can also be stored in 4 bytes (32 bits), if you're willing to waste 7 bits. In my situation, however, every bit counts.

  • I want to use file seek() to find data quickly, not read sequentially through the file.

  • The array will normally be accessed as x[i]. In other words, I'll want the 42 integers corresponding to a given x[i], so these 42 integers should be stored close to each other (ideally, they should be stored adjacent to each other in the file)

  • My initial approach was to just lay down a bitstream, and then find a way to read it back and change it back into an integer. My original question focused on that, but perhaps there's a better solution to the bigger problem that I'm not seeing.

Far too much detail on what I'm doing:

Community
  • 1
  • 1
  • Personal preference is to JSON::encode_json() the string and stick it in the file. That way, it's human readable, and in a format supported by multiple platforms, and you can parse it quickly. Also, it's pretty light-weight :) At my company, we also use the CPAN lib Storable (http://perldoc.perl.org/Storable.html), which also maintains your data structure. However, I'd go JSON :) – rurouni88 Aug 27 '14 at 03:14
  • Please would you explain the still bigger problem that has made you choose to pack bit fields into your file? Why is space efficiency so important? – Borodin Aug 27 '14 at 03:31
  • What are you trying to optimize, space, storing speed, retrieval speed? You have less than 3 MiB of data as int64, which is normally considered a drop in the ocean, yet you keep implying you want to save space. – ikegami Aug 27 '14 at 04:18
  • And if your storage is that constrained, what kind of contraints do you have on RAM? – ikegami Aug 27 '14 at 04:34
  • @Borodin (and ikegami) I think I've just become obsessed with the idea of storing this data in as little space as possible, and with the concept of treating files as strings of 1's and 0's (which is what we always tell people they are anyway). –  Aug 27 '14 at 20:40
  • @barrycarter: I've never told anyone that files are a string of 1s and 0s. They're an ordered, indexable sequence of octets, and you can make those octets mean anything you like. I think this issue getting in the way of you writing a maintainable, working program. Unless you are running 64-bit Perl you have a problem with the 38-bit numbers anyway, so packing fields that aren't aligned on byte boundaries as well is going to complicate things even more. I suggest you don't go any further than allocating the minimum number of *bytes* per field instead of trying to abut a sequence of bit fields. – Borodin Aug 27 '14 at 20:45

1 Answers1

1

I'm not sure I should be encouraging you, but it loks like Data::BitStream will do what you ask.

The program below writes a 38-bit value and a 25-bit value to a file, and then opens and retrieves the values intact.

#!/usr/bin/perl

use strict;
use warnings;

use Data::BitStream;

{
   my $bs_out = Data::BitStream->new(
      mode => 'w',
      file => 'bits.dat',
   );

   printf "Maximum %d bits per word\n", $bs_out->maxbits;

   $bs_out->write(38, 137438953471);
   $bs_out->write(25, 16777215);

   printf "Total %d bits written\n\n", $bs_out->len;
}

{
   my $bs_in = Data::BitStream->new(
      mode => 'ro',
      file => 'bits.dat',
   );

   printf "Total %d bits read\n\n", $bs_in->len;
   print "Data:\n";

   print $bs_in->read(38), "\n";
   print $bs_in->read(25), "\n";
}

output

Maximum 64 bits per word
Total 63 bits written

File size 11 bytes
Total 63 bits read

Data:
137438953471
16777215

38 and 25 is 63 bits of data written, which the module confirms. But there is clearly some additional housekeeping data involved as the total size of the resulting file is eleven bytes, and not just the eight that would be the minimum necessary. Note that, when reopened, the data remembers that it is 63 bits long. However, it is shorter than the sixteen bytes that a file would have to be to contain two simple 64-bit integers.

What you do with this information is up to you, but remember that data packed in this way will be extremely difficult to debug with a hex editor. You may be shooting yourself in the foot if you adopt something like this.

Borodin
  • 126,100
  • 9
  • 70
  • 144