0

My goal is to store hundreds of individual files as efficiently as possible and to read by using Java 1.6. The files consist of an average of 125,000 numbers. Some of the files contain a few hundred numbers, some more than 7,000,000. In most cases, numbers of the range from 0 to 255, 1 byte that can be stored. In some cases, numbers of the range 0-1024, 2 bytes.

To save the data I use the BZip2 implementations of Apache. But BZip2 can only store numbers that are no more than 1 byte in size. That's why I wrote a class that divides a sequence of integers in bits and combines 8 bits to 1 byte. These bytes are then written into the CBZip2InputStream (BZip2 OutputStream). The combination of both algorithms worked quite well. Unfortunately, my algorithm is very slow in reading. The table below shows the time in milliseconds it took to read files with 125,000 numbers.

| Gzip | BZip2 | UTF-8 | my algorithm |

| 47 | 28 | 35 | 1008 |

| 37 | 12 | 13 | 856 |

| 25 | 11 | 10 | 845 |

| 25 | 12 | 5 | 862 |

My algorithm is about 56 times slower than BZip2.

Is there another way to compress efficiently numbers consisting of more than 8 bits. In particular, the reading speed should be most important. The read speed should only 2 to 4 be times higher in similarly high compression as BZip2. If there is no other way will post my source code and explain, as necessary to optimize this.

  • 3
    It's not clear to me why you think a different compression algorithm is necessary. Each of your numbers is broken down into bytes which can then be compressed by any compression algorithm you like, including BZip2. If many of the bytes are zeroes, then compression algorithms will take that into account and compress accordingly. You're not likely to beat a popular compression algorithm. – Louis Wasserman Dec 21 '15 at 21:22
  • What is the compression ratios. Things like bzip/gzip may be delivering with decent compression. – mksteve Dec 21 '15 at 21:26
  • With Bzip2 data is reduced to 40%. @mksteve – Marc Schmidt Dec 21 '15 at 21:41
  • I want to use a different compression algorithm, because my one need to much time to read data. @Louis Wasserman – Marc Schmidt Dec 21 '15 at 21:42
  • @MaggiCraft Why do you think it's possible to do better? – Louis Wasserman Dec 21 '15 at 21:42
  • Why should it take so much time to divide integers in bits, sum them in bytes and the same at around reading differently? @Louis Wasserman – Marc Schmidt Dec 21 '15 at 21:53
  • It's not clear to me why you think that's a lot of time. That said: how did you write the numbers to the files before compression? In text? In some other format? That would probably make some difference. – Louis Wasserman Dec 21 '15 at 21:55
  • When reading a BZip2 InputStream takes only 12 milliseconds and reading with my algorithm 856 milliseconds, which is very long. The numbers are written on the CBZip2OutputStream, which in turn uses an OutputStream. @Louis Wasserman – Marc Schmidt Dec 21 '15 at 22:01
  • How do you sent the numbers to the `OutputStream`? And for what it's worth, your algorithm doesn't actually sound like it does any compression at all. Dividing integers into bits and putting them together into bytes does no compression at all compared to just writing the integers as bytes in the first place. – Louis Wasserman Dec 21 '15 at 22:21
  • The data are sent via the CBZip2InputStream to the java.io.OutputStream. But that's not the problem, because data is written to the CBZip2OutputStream and read very quickly with CBZip2InputStream. My algorithm is used because only numbers in the range 0-255 can be written with bzip2, gzip. When reading with my algorithm the bytes are split into bits which are calculated on integers. This takes so much time. @Louis Wasserman – Marc Schmidt Dec 21 '15 at 22:38
  • Ah, I follow. It sounds like you should be wrapping your `OutputStream` and `InputStream`s with `DataOutputStream` and `DataInputStream`, which will take care of encoding arbitrary sized numbers as bytes. You can do `new DataOutputStream(new CBZip2OutputStream(destinationOutputStream))` to do compression on the `DataOutputStream` results. – Louis Wasserman Dec 21 '15 at 22:39
  • CBZip2OutputStream out = new CBZip2OutputStream(new DataOutputStream(new FileOutputStream(PATH))); out.write(1000); out.flush(); out.close(); writes: 232 Unfortunately this does not work. But thanks for the idea. I'm going to look for solutions in this area. @Louis Wasserman – Marc Schmidt Dec 21 '15 at 23:09
  • You want the other way around. `DataOutputStream out = new DataOutputStream(new CBZip2OutputStream(new FileOutputStream(PATH)))`. – Louis Wasserman Dec 21 '15 at 23:11
  • No, I have to convert the integers to bytes first and write the bytes to the BZip2 stream. @Louis Wasserman – Marc Schmidt Dec 21 '15 at 23:32
  • 1
    @MaggiCraft Converting integers to bytes is part of what `DataOutputStream` _does_. The code I gave you will convert integers to bytes and then pass them through to the BZip2 stream. You can then reverse the transformation on the other end using `DataInputStream`. – Louis Wasserman Dec 21 '15 at 23:35

1 Answers1

0

It sounds like your encoding scheme is very inefficient. Try to use a library that does the conversion for you. See for example protocol-buffers, but any other will do just fine.

Failing that I expect you are using bit operations to make sure the code is as fast as possible. Like:

byte[2] out;
int x;
out[0] = x & 0xff;
out[1] = x >> 8;

x = out[0] + (out[1] << 8);

The slowness might also be because you request very small amounts of data very often. Try using a somewhat large buffer before your read.

Sorin
  • 11,863
  • 22
  • 26
  • So far I have not found any library, then converts the integer in bits and bytes and bits in everything back. I'll try to improve my algorithm with operations. I will also experiment with the buffer size. I did not understand what are protocol-buffers. I do not know how they can help me to store data. @Sorin – Marc Schmidt Dec 21 '15 at 22:44
  • protocol-buffers are a bunch of generic data routines, which include serialization and deserialization among many others. They are optimized for speed. In your case you can define a protocolbuffer with a single repeated field of int (that is your array) and let it to protocolbuffer code to convert to and from stream (or byte array). The encoded data will take a bit more space but I think it will compress just as well. – Sorin Dec 22 '15 at 09:21