Understanding Avro Deserialization from a Large ByteArrayOutputSteam

Question

I'm under the impression that a ByteArrayOutputStream is not memory efficient, since all it's contents are stored in memory.

Similarly, calling toByteArray on a large stream seems like it "scales poorly".

Why, then, in the example in The example in Tom White's book Hadoop: the Definitive Guide use them both:

    ByteArrayOutputStream out = new ByteArrayOutputStream;
    Decoder decoder = DecoderFactory().defaultFactory().createBinaryDecoder(out.toByteArray(), null);

Isn't "Big Data" the norm for Avro? What am I missing?

Edit 1: What I'm trying to do - Say I'm streaming avros over a websocket. What would the example look like if I wanted to deserialize multiple records, not just one that was put in it's own ByteArrayOutoputStream?

Is there a better way to supply BinaryDecoder with a byte[]? Or perhaps a different type of stream? Or I should be sending 1 record per stream instead of loading streams with multiple records?

Your question would be easier to answer if you were specific about what you plan to do with Avro. — hack_on, Feb 25 '13 at 08:48
The long story is that I'm extending [Salat-Avro](https://github.com/Banno/salat-avro) to support serializing Scala case classes to/from Avro datafiles. I'm trying to achieve consistency between the methods for both datafiles and in-memory serialization. For large datafiles, I can deserialize the avros efficiently because a DataFileReader is an **iterator** over the records and doesn't keep evaluations in memory. In contrast to datafiles, in-memory deserialization is accomplished not with an iterator over a steam, but by repeatedly calling the evaluation of a function on it's datasource. — Julian Peeters, Feb 25 '13 at 22:40
The longer story is that the "repeated calling of a function" is no problem for small numbers of records, because I can generate a `Stream` by `cons`ing the result of the function. But as the number of records grows, a `Stream` becomes impractical due to memory usage. Sadly, `Iterator` has no analogy to `cons`, and I was going to figure out what to do about that, when I noticed the canonical example in the question might not support large numbers of records anyways. — Julian Peeters, Feb 25 '13 at 22:59

hack_on · Accepted Answer · 2013-02-26T07:04:45.187

ByteArrayOutputStream makes sense when dealing with small objects like small to medium images, or fixed-size request/response. It is in memory and doesn't touch the disk so this can be great for performance. It doesn't make any sense to use it for 1 TerraByte of data. Possibly this is a case of trying to keep an example in a book small and self-contained so as not to detract from the main point.

EDIT: Now that I see where your going I'd be looking to setup a pipeline. Pull a message off the stream (so I'm assuming you can get in InputStream from your HTTP object) and either process it with a memory-less method or throw it at a queue and have a thread-pool process the queue with a memory-less method. So the requirements for this are 1) being able to detect the boundary between Avro messages as you pull them off the stream and having the method for decoding.

The way to decode appears to be read the bytes for each message into a byte-array and hand that to your BinaryDecoder (after you find the message boundary).

Thanks for the helpful response. – Julian Peeters Feb 26 '13 at 09:42 — Julian Peeters, Feb 26 '13 at 09:42

Understanding Avro Deserialization from a Large ByteArrayOutputSteam

1 Answers1