Questions tagged [avro]

Apache Avro is a data serialization framework primarily used in Apache Hadoop.

Apache Avro is a data serialization system.

Features:

  • Rich data structures.
  • A compact, fast, binary data format.
  • A container file, to store persistent data.
  • Remote procedure call (RPC).
  • Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages.

Schemas:

Avro relies on schemas. When Avro data is read, the schema used when writing it is always present. This permits each datum to be written with no per-value overheads, making serialization both fast and small. This also facilitates use with dynamic, scripting languages, since data, together with its schema, is fully self-describing.

When Avro data is stored in a file, its schema is stored with it, so that files may be processed later by any program. If the program reading the data expects a different schema this can be easily resolved, since both schemas are present.

When Avro is used in RPC, the client and server exchange schemas in the connection handshake. (This can be optimized so that, for most calls, no schemas are actually transmitted.) Since both client and server both have the other's full schema, correspondence between same named fields, missing fields, extra fields, etc. can all be easily resolved.

Avro schemas are defined with JSON . This facilitates implementation in languages that already have JSON libraries.

Comparison with other systems:

Avro provides functionality similar to systems such as Thrift, Protocol Buffers, etc. Avro differs from these systems in the following fundamental aspects.

  • Dynamic typing: Avro does not require that code be generated. Data is always accompanied by a schema that permits full processing of that data without code generation, static datatypes, etc. This facilitates construction of generic data-processing systems and languages.
  • Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size.
  • No manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names.

Available Languages: - - - - - - - -

Official Website: http://avro.apache.org/

Useful Links:

3646 questions
32
votes
6 answers

Polymorphism and inheritance in Avro schemas

Is it possible to write an Avro schema/IDL that will generate a Java class that either extends a base class or implements an interface? It seems like the generated Java class extends the org.apache.avro.specific.SpecificRecordBase. So, the…
bsam
  • 1,838
  • 3
  • 20
  • 26
31
votes
11 answers

Integrating Spark Structured Streaming with the Confluent Schema Registry

I'm using a Kafka Source in Spark Structured Streaming to receive Confluent encoded Avro records. I intend to use Confluent Schema Registry, but the integration with spark structured streaming seems to be impossible. I have seen this question, but…
31
votes
3 answers

What is the advantage of storing schema in avro?

We need to serialize some data for putting into solr as well as hadoop. I am evaluating serialization tools for the same. The top two in my list are Gson and Avro. As far as I understand, Avro = Gson + Schema-In-JSON If that is correct, I do not see…
user2250246
  • 3,807
  • 5
  • 43
  • 71
30
votes
5 answers

How to encode/decode Kafka messages using Avro binary encoder?

I'm trying to use Avro for messages being read from/written to Kafka. Does anyone have an example of using the Avro binary encoder to encode/decode data that will be put on a message queue? I need the Avro part more than the Kafka part. Or, perhaps…
blockcipher
  • 2,144
  • 4
  • 22
  • 35
29
votes
5 answers

KafkaAvroSerializer for serializing Avro without schema.registry.url

I'm a noob to Kafka and Avro. So i have been trying to get the Producer/Consumer running. So far i have been able to produce and consume simple Bytes and Strings, using the following : Configuration for the Producer : Properties props = new…
scissorHands
  • 337
  • 1
  • 4
  • 11
27
votes
3 answers

How to extract schema from an avro file in Java

How do you extract first the schema and then the data from an avro file in Java? Identical to this question except in java. I've seen examples of how to get the schema from an avsc file but not an avro file. What direction should I be looking…
mba12
  • 2,702
  • 6
  • 37
  • 56
27
votes
1 answer

Using apache avro reflect

Avro serialization is popular with Hadoop users but examples are so hard to find. Can anyone help me with this sample code? I'm mostly interested in using the Reflect API to read/write into files and to use the Union and Null annotations. public…
fodon
  • 4,565
  • 12
  • 44
  • 58
25
votes
2 answers

google dataflow job cost optimization

I have run the below code for 522 gzip files of size 100 GB and after decompressing, it will be around 320 GB data and data in protobuf format and write the output to GCS. I have used n1 standard machines and region for input, output all taken care…
25
votes
1 answer

Apache Kafka with Avro and Schema Repo - where in the message does the schema Id go?

I want to use Avro to serialize the data for my Kafka messages and would like to use it with an Avro schema repository so I don't have to include the schema with every message. Using Avro with Kafka seems like a popular thing to do, and lots of…
jheppinstall
  • 2,338
  • 4
  • 23
  • 27
24
votes
2 answers

Does binary encoding of AVRO compress data?

In one of our projects we are using Kafka with AVRO to transfer data across applications. Data is added to an AVRO object and object is binary encoded to write to Kafka. We use binary encoding as it is generally mentioned as a minimal representation…
Pal
  • 243
  • 1
  • 2
  • 7
23
votes
6 answers

Avro with Java 8 dates as logical type

Latest Avro compiler (1.8.2) generates java sources for dates logical types with Joda-Time based implementations. How can I configure Avro compiler to produce sources that used Java 8 date-time API?
injecto
  • 829
  • 1
  • 10
  • 23
23
votes
3 answers

generating an AVRO schema from a JSON document

Is there any tool able to create an AVRO schema from a 'typical' JSON document. For example: { "records":[{"name":"X1","age":2},{"name":"X2","age":4}] } I found http://jsonschema.net/reboot/#/ which generates a 'json-schema' { "$schema":…
Pierre
  • 34,472
  • 31
  • 113
  • 192
22
votes
6 answers

Json String to Java Object Avro

I am trying to convert a Json string into a generic Java Object, with an Avro Schema. Below is my code. String json = "{\"foo\": 30.1, \"bar\": 60.2}"; String schemaLines =…
Princey James
  • 708
  • 1
  • 6
  • 13
21
votes
3 answers

In Java, how can I create an equivalent of an Apache Avro container file without being forced to use a File as a medium?

This is somewhat of a shot in the dark in case anyone savvy with the Java implementation of Apache Avro is reading this. My high-level objective is to have some way to transmit some series of avro data over the network (let's just say HTTP for…
omnilinguist
  • 591
  • 4
  • 12
21
votes
1 answer

What's the reason behind ZigZag encoding in Protocol Buffers and Avro?

ZigZag requires a lot of overhead to write/read numbers. Actually I was stunned to see that it doesn't just write int/long values as they are, but does a lot of additional scrambling. There's even a loop…
Endrju
  • 2,354
  • 16
  • 23