4

I have file that consists of a sequence of real time unix timestamps with microsecond accuracy, i.e. the timestamps can never decrease. All the timestamps that need to be coded/decoded are from the same day. A sample entry in the file might be something like 1364281200.078739 that corresponds to 1364281200078739 usecs since epoch. Data is unevenly spaced and bounded.

I need to achieve a compression of around 10 bits/timestamp. Currently I am able to compress to average of 31 bits/timestamp by calculating difference between consecutive timestamps. How can I improve further ?

Edit:

We are calculating Degree of Compression as (Size of encoded file in bytes)/(Number of timestamps)*8. I divided the timestamps into two parts before '.' and after it. The integer part is quite constant and max difference between two integer part timestamps is 32 so I encoded it using 0-8 bits. The precision part is quite random so I have ignored the leading bits and wrote into file using 0-21 bits(as max it can be 999999). But the size of my encoded file is coming as 4007674 bytes and hence compression as 71.05 bits/TS. I also write '.' and a space between two timestamps to decode later. How can I improve upon my size of encoded file ?

Here is the link for partial data set - http://pastebin.com/QBs9Bqv0

Here is the link for differential timestamps value in micro-seconds - http://pastebin.com/3QJk1NDV Maximum difference b/w timestamps is - 32594136 micro sec.

learner
  • 123
  • 2
  • 11
  • In current file I have 451210 entries. We need lossless compression. Here are sample entries in file - 1364281200.078739 1364281232.672875 1364281232.788200 1364281232.792756 1364281232.793052 1364281232.795598..... – learner Mar 27 '16 at 11:57
  • I think this is one of those questions where you really need to upload a larger sample data set somewhere, maybe an hour's worth of data if the whole thing is too big? – m69's been on strike for years Mar 27 '16 at 20:47
  • I have added the link to dataset in the question. – learner Mar 28 '16 at 05:36

4 Answers4

9

If you take the interval between every timestamp and the previous one and express it in microseconds (i.e. as integers), the distribution of values per bit depth in your sample file is:

interval bit depth distribution

So 52.285% of the values are either 0 or 1, there are only a handful of other values below 64 (2~6-bits), 27.59% of the values are 7~12-bits, there's a fairly even distribution of around 2.1% per bit up to 20-bits, and only 3% above 20-bits, with a maximum of 25-bits. Looking at the data, it's also obvious that there are many sequences of up to 6 consecutive zeros.

These observations gave me the idea of using a variable bit-size per value, something like this:

00 0xxxxx                             0 (xxxxx is the number of consecutive zeros)
00 1xxxxx                             1 (xxxxx is the number of consecutive ones)
01 xxxxxx xxxxxxxx                    2-14 bit values
10 xxxxxx xxxxxxxx xxxxxxxx           15-22 bit values
11 xxxxxx xxxxxxxx xxxxxxxx xxxxxxxx  23-30 bit values

A quick test showed that this resulted in a compression rate of 13.78 bits per timestamp, which isn't quite the 10 bits you were aiming for, but not a bad start for a simple scheme.


After analysing the sample data some more, I observed that there are a lot of short sequences of consecutive 0's and 1's, like 0 1 0, so I replaced the 1-byte scheme with this one:

00xxxxxx    00     = identifies a one-byte value
            xxxxxx = index in the sequence table

The table of sequences:

index ~ seq    index ~ seq    index ~ seq    index ~ seq    index ~ seq     index ~ seq
  0      0       2     00       6     000     14    0000     30   00000     62   000000
  1      1       3     01       7     001     15    0001     31   00001     63   000001
                 4     10       8     010     16    0010     32   00010    
                 5     11         ...            ...           ...
                               11     101     27    1101     59   11101    
                               12     110     28    1110     60   11110    
                               13     111     29    1111     61   11111    

For the example file with 451,210 timestamps, this brings down the encoded file size to 676,418 bytes, or 11.99 bits per timestamp.

Testing the above method revealed that there were 98,578 single zeros and 31,271 single ones between larger intervals. So I tried using 1 bit of each larger interval to store whether it was followed by a zero, which decreased the encoded size to 592,315 bytes. And when I used 2 bits to store whether larger intervals were followed by 0, 1 or 00 (the most common sequence), the encoded size decreased to 564,034 bytes, or 10.0004 bits per timestamp.
I then changed to storing the single 0's and 1's with the following large interval instead of the preceding one (purely for reasons of code simplicity) and found that this resulted in a file size of 563.884 bytes, or 9.997722 bits per timestamp!

So the complete method is:

Store the first timestamp (8 bytes), then store the intervals as either: 

00 iiiiii                             sequences of up to 5 (or 6) zeros or ones
01 XXxxxx xxxxxxxx                     2-12 bit values (2 ~ 4,095)
10 XXxxxx xxxxxxxx xxxxxxxx           13-20 bit values (4,096 ~ 1,048,575)
11 XXxxxx xxxxxxxx xxxxxxxx xxxxxxxx  21-28 bit values (1,048,576 ~ 268,435,455)

iiiiii = index in sequence table (see above)
XX     = preceded by a zero (if XX=1), a one (if XX=2) or two zeros (if XX=3)
xxx... = 12, 20 or 28 bit value

Example of an encoder:

#include <stdint.h>
#include <iostream>
#include <fstream>
using namespace std;

void write_timestamp(ofstream& ofile, uint64_t timestamp) {    // big-endian
    uint8_t bytes[8];
    for (int i = 7; i >= 0; i--, timestamp >>= 8) bytes[i] = timestamp;
    ofile.write((char*) bytes, 8);
}

int main() {
    ifstream ifile ("timestamps.txt");
    if (! ifile.is_open()) return 1;
    ofstream ofile ("output.bin", ios::trunc | ios::binary);
    if (! ofile.is_open()) return 2;

    long double seconds;
    uint64_t timestamp;

    if (ifile >> seconds) {
        timestamp = seconds * 1000000;
        write_timestamp(ofile, timestamp);
    }

    while (! ifile.eof()) {
        uint8_t bytesize = 0, len = 0, seq = 0, bytes[4];
        uint32_t interval;

        while (bytesize == 0 && ifile >> seconds) {
            interval = seconds * 1000000 - timestamp;
            timestamp += interval;

            if (interval < 2) {
                seq <<= 1; seq |= interval;
                if (++len == 5 && seq > 0 || len == 6) bytesize = 1;
            } else {
                while (interval >> ++bytesize * 8 + 4);
                for (uint8_t i = 0; i <= bytesize; i++) {
                    bytes[i] = interval >> (bytesize - i) * 8;
                }
                bytes[0] |= (bytesize++ << 6);
            }
        }
        if (len) {
            if (bytesize > 1 && (len == 1 || len == 2 && seq == 0)) {
                bytes[0] |= (2 * len + seq - 1) << 4;
            } else {
                seq += (1 << len) - 2;
                ofile.write((char*) &seq, 1);
            }
        }
        if (bytesize > 1) ofile.write((char*) bytes, bytesize);
    }
    ifile.close();
    ofile.close();
    return 0;
}

Example of a decoder:

#include <stdint.h>
#include <iostream>
#include <fstream>
using namespace std;

uint64_t read_timestamp(ifstream& ifile) {    // big-endian
    uint64_t timestamp = 0;
    uint8_t byte;
    for (uint8_t i = 0; i < 8; i++) {
        ifile.read((char*) &byte, 1);
        if (ifile.fail()) return 0;
        timestamp <<= 8; timestamp |= byte;
    }
    return timestamp;
}

uint8_t read_interval(ifstream& ifile, uint8_t *bytes) {
    uint8_t bytesize = 1;
    ifile.read((char*) bytes, 1);
    if (ifile.fail()) return 0;
    bytesize += bytes[0] >> 6;
    for (uint8_t i = 1; i < bytesize; i++) {
        ifile.read((char*) bytes + i, 1);
        if (ifile.fail()) return 0;
    }
    return bytesize;
}

void write_seconds(ofstream& ofile, uint64_t timestamp) {
    long double seconds = (long double) timestamp / 1000000;
    ofile << seconds << "\n";
}

uint8_t write_sequence(ofstream& ofile, uint8_t seq, uint64_t timestamp) {
    uint8_t interval = 0, len = 1, offset = 1;
    while (seq >= (offset <<= 1)) {
        seq -= offset;
        ++len;
    }
    while (len--) {
        interval += (seq >> len) & 1;
        write_seconds(ofile, timestamp + interval);
    }
    return interval;
}

int main() {
    ifstream ifile ("timestamps.bin", ios::binary);
    if (! ifile.is_open()) return 1;
    ofstream ofile ("output.txt", ios::trunc);
    if (! ofile.is_open()) return 2;
    ofile.precision(6); ofile << std::fixed;

    uint64_t timestamp = read_timestamp(ifile);
    if (timestamp) write_seconds(ofile, timestamp);

    while (! ifile.eof()) {
        uint8_t bytes[4], seq = 0, bytesize = read_interval(ifile, bytes);
        uint32_t interval;

        if (bytesize == 1) {
            timestamp += write_sequence(ofile, bytes[0], timestamp);
        }
        else if (bytesize > 1) {
            seq = (bytes[0] >> 4) & 3;
            if (seq) timestamp += write_sequence(ofile, seq - 1, timestamp);
            interval = bytes[0] & 15;
            for (uint8_t i = 1; i < bytesize; i++) {
                interval <<= 8; interval += bytes[i];
            }
            timestamp += interval;
            write_seconds(ofile, timestamp);
        }
    }
    ifile.close();
    ofile.close();
    return 0;
}

Because of the long double output bug in the MinGW/gcc 4.8.1 compiler I'm using, I had to use this workaround: (this shouldn't be necessary with other compilers)

void write_seconds(ofstream& ofile, uint64_t timestamp) {
    long double seconds = (long double) timestamp / 1000000;
    ofile << "1" << (double) (seconds - 1000000000) << "\n";
}

Note to future readers: this method is based on analysis of an example data file; it will not give the same compression rate if your data is different.

  • We are calculating Degree of Compression as (Size of encoded file in bytes)/(Number of timestamps)*8. I divided the timestamps into two parts before '.' and after it. The integer part is quite constant and max difference between two integer part timestamps is 32 so I encoded it using 0-8 bits. The precision part is quite random so I have ignored the leading bits and wrote into file using 0-21 bits(as max it can be 999999). But the size of my encoded file is coming as 4007674 bytes and hence compression as 71.05 bits/TS. How can I improve upon my size of encoded file ? – learner Mar 29 '16 at 01:56
  • I also write '.' and a space between two timestamps to decode later. How can I improve upon my size of encoded file ? Also added this information in question – learner Mar 29 '16 at 02:04
  • I added an example; I hope this makes things clearer. – m69's been on strike for years Mar 29 '16 at 04:25
  • How will the decoding work in this case ? There are cases like 1364331598.975142 1364331599.056643 where precision is less than last precision value. – learner Mar 29 '16 at 05:08
  • @learner The difference is calculated by subtracting the whole values, not just the part after the decimal point. So 1364331599.056643 - 1364331598.975142 = 0.081501 sec, that is 81501 µs, 81501 = 10011111001011101 (17 bits) so it's stored as 3 bytes: 10000001 00111110 01011101. – m69's been on strike for years Mar 29 '16 at 05:39
  • I have compressed a 8 MB file (415210 records) to 1.6 MB achieving compression of 30.348 bits/timestamp. This is still far way from my target of 10 bits/timestamp. This is my partial encoded file of differences http://pastebin.com/3QJk1NDV. My max difference between two timestamps is 32594136 µs Can we optimize this further. – learner Mar 29 '16 at 06:56
  • @learner The partial encoded file has an initial timestamp and 21818 integers. The distribution of values is very much like the shorter sample file, with 50.21% of the values being 0 or 1. If you use the method I described in my answer, you get 9882 x 8bit, 6595 x 16bit, 4176 x 24bit, 92 x 32bit and 1 x 64bit, so that is 35976 bytes or 287808 bits in total, or 13.19 bits per timestamp. – m69's been on strike for years Mar 29 '16 at 16:30
  • I had a sample implementation here http://ideone.com/Dzn5Qt from this I am getting 15,75,103 bytes encoded file from 81,21,780 bytes timestamp file. The number of records are 415210.I am getting number of bytes depending on difference of timestamps (your first approach). I am not sure why I am getting compression ratio of 30.34. I have uploaded my complete dataset here https://www.dropbox.com/s/w98c2m8qqxx1cnp/timestamps.txt?dl=0 and encoded file here https://www.dropbox.com/s/nrh7vkzer3plx51/encode.txt?dl=0 – learner Mar 30 '16 at 01:35
  • @learner You seem to be writing to a text file, judging by the presence of `fprintf()`. My suggestion was to write each difference value as 1, 2, 3 or 4 bytes to a binary file. That's why we're getting completely different results. – m69's been on strike for years Mar 30 '16 at 02:51
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/107676/discussion-between-learner-and-m69). – learner Mar 30 '16 at 03:00
  • 1
    @learner I decided to try and write code for this, to refamiliarise myself with C++. I just posted a first working version. Don't worry if you no longer need it, I'm doing it as an exercise. – m69's been on strike for years Apr 04 '16 at 04:59
1

If you need lossless compression at microsecond accuracy, then note that 10 bits will let you count to 1024.

If the timing of your events is random, and you actually need the microsecond accuracy that you spec'ed, this means your differential timestamps cannot have a variance of more than about 1 millisecond, without exceeding your 10 bit/event budget.

Based on a quick glance at your data, you probably can't quite make 10 bits/timestamp. However, your differencing is the right first step, and you can do better than 31 bits -- I would do statistics on a sample dataset, and choose a binary prefix encoding that reflects that distribution.

You should make sure your code has room to encode large gaps if necessary, so consider basing it on on a universal code.

comingstorm
  • 25,557
  • 3
  • 43
  • 67
0

It's hard to know without seeing a histogram of the differences for your data. I would try a Rice Code to encode the differences, selecting the parameter to get the best compression for your distribution of differences.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
0

Compressing integers (especially sorted integers) is a well-studied research topic. You may want to use this project.

youkaichao
  • 1,938
  • 1
  • 14
  • 26