4

I have a file with 25,000 floats (, delimited) and there are about 100K such rows. A row of the file looks something like:

1689.97,-9643.39,-82082.1,9776.09,-33974.84,-67247.38,32997.34,72811.53,31642.87,-949.6,9340.68,-85854.48,-17705.36,187.74,-3002.6,-35812.21,37382.32,22770.78,40893.09,45743.99,-6500.92,26243.85,13975.95,0,56669.47,-25865.36,-17066.78,26788.57,0,-36554.86,-3687.19,18933.93

I have a 2 part question.

  1. Is there a way (in Java or Python) to compress data efficiently without effecting the performance much. The compression would be done once per day, but data has to be read quite often.
  2. Can the data be manipulated in the compressed form e.g. I would like to aggregate first 10 columns on the first 10 rows without decompressing. That way I dont have to worry about frequent reads to compressed data. One of the challenges would be converting 25,000 string to float for addition.

I have looked at gzip and zcat and they are good options. But I wanted to find some compression or serializing algo to store data through Java/Python and perform reads without decompressing.

Wayne Koorts
  • 10,861
  • 13
  • 46
  • 72
Ashu
  • 77
  • 4
  • 1
    Lookie: http://stackoverflow.com/questions/87679/advice-on-handling-large-data-volumes – David Feb 07 '13 at 19:41
  • The file must be an ASCII file, or you can consider having a binary file instead? The floats are single or double precision? If they are single precision then probably the easiest thing is to store the binary representation of the floats in the file. – Bakuriu Feb 07 '13 at 19:44
  • To whoever reverted the changes to the question: 1) tags do not belong to the title and thus "in java" should *not* be mentioned there. There is a tag for that(and the OP is using it). Also, the formatting I gave is correct. The Op wanted an enumerand list and now it has it, and that huge line must definitely be displayed as code. – Bakuriu Feb 07 '13 at 19:53
  • @David: thanks. I'll take a look at the mapped byte buffers. – Ashu Feb 07 '13 at 20:09
  • @Bakuriu: I have thought of storing them as binary. the only problem is when I have to read, I'd have to deserialize. Was hoping there is a way to read it as binary and do manipulations, then convert it back to ascii – Ashu Feb 07 '13 at 20:32

3 Answers3

3

In Java, you can wrap your OutputStream with a GZIPOutputStream and your InputStream with a GZIPInputStream to compress/decompress your data on the fly using the GZIP algo.

beny23
  • 34,390
  • 5
  • 82
  • 85
0

use DataOutPutStream and writeFloat and you don't need use comma separator

Edgard Leal
  • 2,592
  • 26
  • 30
  • This does not imply that the resulting file will be smaller. Floats can take up to 8 bytes to be represented, if the ASCII representation is smaller the file size might increase, or might be reduced by a small factor. – Bakuriu Feb 07 '13 at 19:54
  • `DataOutputStream out =
    new DataOutputStream(new FileOutputStream("out.dat"));`
    `// out.writeFloat(0F); // 4bytes`
    `//out.writeChars("0,"); // 4 bytes`
    `out.close();`
    at worst, is the same size
    – Edgard Leal Feb 07 '13 at 20:15
  • No `0,` are *two* bytes, since it's ASCII. Also, if they are doubles then `12345.67` takes 8 bytes, which is the same as its binary representation. There is a quite high probability that the size would decrease but it depends on the representation of the floats in ASCII. Also, gzipping the ASCII file reduces its size by half, while a binary file would probably be compressed by a smaller amount(which, again doesn't guarantee that the binary representation in the end will be smaller). – Bakuriu Feb 08 '13 at 06:40
0

Instead of writing it out as text, you could write it out as bytes. You would have to convert to/from premitives to byte arrays, but I don't think that would be too hard. I know you can use Float.floatToRawIntBits() to convert to an int and Float.intBytesToFloat() to go back from the int. Converting an int to a byte[] is just a matter of a couple of bit shifts.

CodeChimp
  • 8,016
  • 5
  • 41
  • 79