Performance cost of serialization and compress a Object in Java

Question

The application keeps receiving objects named Report and put the objects into Disruptor for three different consumers.

With the help of Eclipse Memory Analysis, the Retained Heap Size of each Report object is 20KB on average. The application starts with -Xmx2048, indicating the heap size of the application is 2GB.

However, the number of the objects is around 100,000 at a time, which means that the total size of all the object is roughly 2GB.

The requirement is that all 100,000 objects should be loaded into Disruptor so that the consumers would consume the data asynchronously. But it's not possible if the size of each object is as large as 20KB.

So I'd like to serialize the object to String and compress it:

private static byte[] toBytes(Serializable o) throws IOException {
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    ObjectOutputStream oos = new ObjectOutputStream(baos);
    oos.writeObject(o);
    oos.close();

    return baos.toByteArray();
}

private static String compress(byte[] str) throws IOException {
    ByteArrayOutputStream out = new ByteArrayOutputStream();
    GZIPOutputStream gzip = new GZIPOutputStream(out);
    gzip.write(str);
    gzip.close();
    return new String(Base64Coder.encode(out.toByteArray()));
}

After compress(toBytes(Report)), the object size is smaller:

Before compression

After compression

Right now the String of object is around 6KB. It's better now.

Here's my question:

Is there any other data format whose size is less than String?
Calling serialization and compression each time will create objects like ByteArrayOutputStream, ObjectOutputStream and so on. I don't want to create to many objects like ByteArrayOutputStream, ObjectOutputStream because I need to iterate 100,000 times.How to design the codes so that objects like ByteArrayOutputStream, ObjectOutputStream only create once and use it for each iteration?
Consumers need to deserialize and decompress the String from Disruptor. If I have three consumers so I need to deserialize and decompress three times. Any way around?

Update:

As @BoristheSpider suggested, the serialization and compression should be perform in one action:

private static byte[] compressObj(Serializable o) throws IOException {
    ByteArrayOutputStream bos = new ByteArrayOutputStream();
    GZIPOutputStream zos = new GZIPOutputStream(bos);
    ObjectOutputStream ous = new ObjectOutputStream(zos);

    ous.writeObject(o);
    zos.finish();
    bos.flush();

    return bos.toByteArray();
}

This probably isn't the way to go. It will have a huge performance impact. There are design patterns specifically for cases like this, like the [Flyweight pattern](http://en.wikipedia.org/wiki/Flyweight_pattern). On a side note - why don't you compress the stream directly? Why do you first create `byte[]` then compress it? — Boris the Spider, Mar 25 '14 at 08:10
The app receive a custom object, `Report`. What I'm doing is serialize `Report` and compress the serialized String. You suggest compressing `Report` directly? — macemers, Mar 25 '14 at 08:14
I suggest wrapping the `ByteArrayOutputStream` in a `GZIPOutputStream` _then an_ `ObjectOutputStream`. This will serialize and compress in one action. — Boris the Spider, Mar 25 '14 at 08:16
“*Is there any other data format whose size is less than String*” — Of course, there is. The byte array you had before creating the base64 encoded `String` is far more compact. Why don’t you keep the byte array rather than creating a `String`? At some time you might want to decompress a report (otherwise you don’t need to store it at all) and at that time you will need a byte array again. — Holger, Mar 25 '14 at 08:28
@Holger you're right. `byte[]` is much smaller. In my case, the object is 20k and String is 6k but `byte[]` is around 2k. — macemers, Mar 25 '14 at 08:49

score 0 · Accepted Answer · answered Mar 25 '14 at 08:48

0

Using ObjectOutputStream and compression is so much more expensive than using Disruptor it defeats the purpose of using it. It is likely to be 1000x more expensive.

You are far better off limiting how many objects you queue at once. Unless you have something seriously wrong with your design, having a queue of just 1000 20 KB objects should be more than enough to ensure all you consumers are working efficiently.

BTW if you need persistence, I would use Chronicle (partly because I wrote it) This doesn't need compression or byte[] or Strings for storage, persists all messages, your queue is unbounded and entirely off heap. i.e. your 100K objects will use << 1 MB of heap.

answered Mar 25 '14 at 08:48

Peter Lawrey

525,659
79
751
1,130

Thx for your answer Peter. My concern is that one of consumer, database, is not catching up so that it lags behind, the `Disruptor` is waiting for it until it could further process. – macemers Mar 25 '14 at 09:00
@user838204 In that case I would spool it to a Chronicle which will persist the data as you go and you don't have to worry if the database is an hour or a day behind. If you need redundancy, Chronicle supports TCP replication. – Peter Lawrey Mar 25 '14 at 09:57
@user838204 or just use a message broker like ActiveMQ. It can be configured to persist messages etc. – Boris the Spider Mar 25 '14 at 10:41
@PeterLawrey why my 100k objects with 20kb each use << 1MB off heap? – macemers Mar 26 '14 at 16:51
@user838204 all the keys and values are stored off heap so only a small amount of objects are needed on heap no matter the size of the entries. – Peter Lawrey Mar 26 '14 at 16:59
But the whole memory of the server is only 8GB. 2GB for java's heap. And there're other apps running on the same server, enough for Chronicle? BTW, does Chronicle support one thread persisting message while another reading from it? – macemers Apr 04 '14 at 02:54
@user838204 Chronicle need about 100 KB of main memory each, and I don't suggest you have too many of them. The more memory you have free, the more it can use, but this is handled transparently by the OS and it is not something you should have to worry about. It supports multiple readers and writers in different *processes* on the same machine. It also supports TCP replication. – Peter Lawrey Apr 04 '14 at 15:29
@user838204 Something to consider is that, Chronicle is designed to record all the data until you deleted it externally e.g. once per day. This is useful for replaying long sequences of real data in testing or reproducing a bug. However, it assumes disk space is very cheap which is not true in all organizations. – Peter Lawrey Apr 04 '14 at 15:31
@user838204 A SharedQueue which is under development which will assume consumption of a message once it is read. This will use less disk space but you have no record. – Peter Lawrey Apr 04 '14 at 15:33
Thx for your detailed explanation and I've fork the Chronicle and start looking into it. Could I contribute to Chronicle as well cause I am really want to learn writing high performance application in Java? BTW, my problem is described separately here:http://stackoverflow.com/questions/22859618/whats-the-best-way-to-asynchronously-handle-low-speed-consumer-database-in-hi?noredirect=1#comment34876833_22859618 Please take a look and see if Chronicle helps? – macemers Apr 04 '14 at 16:20
@user838204 You can issue pull requests for chronicle. Make sure you fork the OpenHFT version. Chronicle is GC-less, allows queuing up to your free disk space, and frees your producer from your consumer. i.e. it doesn't matter if your consumer is slow, or even running. – Peter Lawrey Apr 07 '14 at 11:46
Peter, just want to make sure, did `Chronicle` support this: having enough size to contain all the messages so that it won't slow down the `Disruptor` while let the database consumer to consume the message from `Chronicle`, just like a blocking queue? – macemers May 23 '14 at 08:05
@user838204 Whether you have enough size is based on how much free disk space you have;. If you have only 1 TB free, you can queue up to 1 TB. – Peter Lawrey May 23 '14 at 08:10
@PeterLawrey I understand the size now. So `Chronicle` could serve as something very similar to BlockingQueue, allowing Producer putting data while Consumer getting data. And it's based on Memory-Mapping File rather than Memory only, which results in GC-less and high performance. Is my understanding correct? – macemers May 23 '14 at 08:47
@user838204 yes, and instead of being limited by the size of the queue, or your memory, it is limited by the free disk space you have, which is assumed to be much larger. – Peter Lawrey May 23 '14 at 08:49

Performance cost of serialization and compress a Object in Java

1 Answers1

Linked