31

We need to serialize some data for putting into solr as well as hadoop.

I am evaluating serialization tools for the same.

The top two in my list are Gson and Avro.

As far as I understand, Avro = Gson + Schema-In-JSON

If that is correct, I do not see why Avro is so popular for Solr/Hadoop?

I have searched a lot on the Internet, but cannot find a single correct answer for this.

Everywhere it says, Avro is good because it stores schema. My question is what to do with that schema?

It may be good for very large objects in Hadoop where a single object is stored in multiple file blocks such that storing schema with each part helps to analyze it better. But even in that case, schema can be stored separately and just a reference to that is sufficient to describe the schema. I see no reason why schema should be part of each and every piece.

If someone can give me some good use case how Avro helped them and Gson/Jackson were insufficient for the purpose, it would be really helpful.

Also, official documentation at the Avro site says that we need to give a schema to Avro to help it produce Schema+Data. My question is, if schema is input and the same is sent to output along with JSON representation of data, then what extra is being achieved by Avro? Can I not do that myself by serializing an object using JSON, adding my input schema and calling it Avro?

I am really confused with this!

user2250246
  • 3,807
  • 5
  • 43
  • 71

3 Answers3

12
  1. Evolving schemas

Suppose intially you designed an schema like this for your Employee class

{
{"name": "emp_name", "type":"string"},
{"name":"dob", "type":"string"},
{"name":"age", "type":"int"}
}

Later you realized that age is redundant and removed it from the schema.

{
{"name": "emp_name", "type":"string"},
{"name":"dob", "type":"string"}
}

What about the records that were serialized and stored before this schema change. How will you read back those records?

That's why the avro reader/deserializer asks for the reader and writer schema. Internally it does schema resolution ie. it tries to adapt the old schema to new schema.

Go to this link - http://avro.apache.org/docs/1.7.2/api/java/org/apache/avro/io/parsing/doc-files/parsing.html - section "Resolution using action symbols"

In this case it does skip action, ie it leaves out reading "age". It can also handle cases like a field changes from int to long etc..

This is a very nice article explaining schema evolution - http://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html

  1. Schema is stored only once for multiple records in a single file.

  2. Size, encoded in very few bytes.

Stephen Kennedy
  • 20,585
  • 22
  • 95
  • 108
Vishal John
  • 4,231
  • 25
  • 41
  • 5
    I don't understand what is helpful about this. If the schema changes, aren't the object semantics likely to change as well? How could an automated system reliably determine how to interpret things like semantically conflicting fields? – Erik Garrison Jun 07 '15 at 08:54
  • Also it should be noted that skipping outdated fields is standard practice in other IDLs (at least protobuf, which I am familiar with). – Erik Garrison Jun 07 '15 at 08:56
  • This is great information "Schema is stored only once for multiple records in a single file.", but could not find a reference for this info, please share. – Sankalp Nov 25 '16 at 05:20
9

I think one of the key problems solved by schema evolution is not mentioned anywhere explicitly and that is why it causes so much confusion for the new-comers.

An example will clarify this:

Let us say a bank stores an audit log of all its transactions. The logs have a particular format and need to be stored for at least 10 years. It is also very much desirable that the system holding these logs should adapt to the format evolving in all of these 10 years.

Schema for such entries would not change too often, let us say twice an year on an average but each schema would have a large number of entries. If we do not keep track of the schemas, then after a while, we will need to consult very old code to figure out the fields present at that time and keep on adding if-else statements for processing different formats. With a schema store of all these formats, we can use the schema-evolution feature to automatically convert one kind of format into the other (Avro does this automatically if you provide it with older and newer schemas). This saves the applications from adding lot of if-else statements in their code and also makes it more manageable as we readily know what are all the formats we have by looking at the set of schemas stored (Schemas are generally stored in a separate storage and the data only has an ID pointing to its schema).

Another advantage of schema evolution is that producers of new format can safely produce objects with new schema without waiting for the downstream consumers to change first. The downstream consumers can have the logic built in to simply suspend processing unless they have visibility of the new schema associated with a new format. This automatic suspension is great to keep the system online and adapt the processing logic for the new schema.

So in summary, schema evolution helps the newer clients read older formats by making use of automatic format conversion and also helps the older clients suspend processing in a graceful manner till they have been enabled to understand newer formats.

user2250246
  • 3,807
  • 5
  • 43
  • 71
  • But imagine you have log monitoring system, you change the schema of data format produced by application/services/components... but at the same time your monitoring system will not be able to handle those and will become defacto unusable. Same apply to your banking transaction use cases from my perspective. Fine, I have new format coming in, but no one can process it...:-)) It will be useful if Avro will let the new format produced to be consumed by consumers which are still on old schema version and preparing for migration. Then there will be no outage, but what you are saying is not helping. – kensai Apr 21 '17 at 10:36
  • I agree on one fact, consumers could produce new model and decouple from validation by consumers, which in SOA/microservice architecture will otherwise just reject, so stoping consumers. So I can change independently consumer/producer. Avro is not all solver, but fundamentally apply one of old-fashion and main SOA/microservice principle, the functionality decoupling. – kensai Apr 21 '17 at 10:40
1

Having schema is good from schema evolution perspective but isn't something that cannot be done in JSON. Its exactly the same thing with JSON as in Avro.

Main thing with Avro is that it's binary. So let us say, your record is:

{
  {"name": "emp_name", "type":"string"},
  {"name":"dob", "type":"string"},
  {"name":"age", "type":"int"}
}

Avro will store the data portion for the above as: "John"<M>19901020<M>32

Here <M> is some kind of marker to demarcate between the field-values.

So without the schema, the above data is useless ! You don't know what those values mean without the schema.

That's the difference between binary and non-binary formats. Binary formats have just the values and that makes them super compact. Text formats like JSON and XML have schema repeated over and over again, thus making them more than 2x bulkier than binary.

Another comparison: An integer in binary takes only 32 bits but an integer in JSON form is a number stored in text format.

{"some_number": 43847384}

The above number in JSON is stored not like a 32-bit integer in JSON, it's all text and takes up 8 chars * 8 bits = 64 bits or even more depending on the length of the number.

So schema in JSON is not optional. It is a MUST if you have to work with binary formats.

user2250246
  • 3,807
  • 5
  • 43
  • 71