Questions tagged [avro]

Apache Avro is a data serialization framework primarily used in Apache Hadoop.

Apache Avro is a data serialization system.

Features:

  • Rich data structures.
  • A compact, fast, binary data format.
  • A container file, to store persistent data.
  • Remote procedure call (RPC).
  • Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages.

Schemas:

Avro relies on schemas. When Avro data is read, the schema used when writing it is always present. This permits each datum to be written with no per-value overheads, making serialization both fast and small. This also facilitates use with dynamic, scripting languages, since data, together with its schema, is fully self-describing.

When Avro data is stored in a file, its schema is stored with it, so that files may be processed later by any program. If the program reading the data expects a different schema this can be easily resolved, since both schemas are present.

When Avro is used in RPC, the client and server exchange schemas in the connection handshake. (This can be optimized so that, for most calls, no schemas are actually transmitted.) Since both client and server both have the other's full schema, correspondence between same named fields, missing fields, extra fields, etc. can all be easily resolved.

Avro schemas are defined with JSON . This facilitates implementation in languages that already have JSON libraries.

Comparison with other systems:

Avro provides functionality similar to systems such as Thrift, Protocol Buffers, etc. Avro differs from these systems in the following fundamental aspects.

  • Dynamic typing: Avro does not require that code be generated. Data is always accompanied by a schema that permits full processing of that data without code generation, static datatypes, etc. This facilitates construction of generic data-processing systems and languages.
  • Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size.
  • No manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names.

Available Languages: - - - - - - - -

Official Website: http://avro.apache.org/

Useful Links:

3646 questions
21
votes
6 answers

Generic conversion from POJO to Avro Record

I'm looking for a way to convert a POJO to an avro object in a generic way. The implementation should be robust to any changes of the POJO-class. I have achieved it but filling the avro record explicitly (see example below). Is there a way to get…
Fabian Braun
  • 3,612
  • 1
  • 27
  • 44
20
votes
1 answer

How to pass parameters for a specific Schema registry when using Kafka Avro Console Consumer?

I am trying to use Confluent kafka-avro-console-consumer, but how to pass parameters for Schema Registry to it?
Joe
  • 11,983
  • 31
  • 109
  • 183
20
votes
3 answers

How to extract schema for avro file in python

I am trying to use the Python Avro library (https://pypi.python.org/pypi/avro) to read a AVRO file generated by JAVA. Since the schema is already embedded in the avro file, why do I need to specify a schema file? Is there a way to extract it…
ljxue
  • 305
  • 1
  • 3
  • 7
19
votes
3 answers

"The $changeStream stage is only supported on replica sets" error while using mongodb-source-connect

I get an error when running kafka-mongodb-source-connect I was trying to run connect-standalone with connect-avro-standalone.properties and MongoSourceConnector.properties so that Connect write data which is written in MongoDB to Kafka topic. This…
Jaeho Lee
  • 463
  • 1
  • 5
  • 14
19
votes
2 answers

Avro multiple record of same type in single schema

I like to use the same record type in an Avro schema multiple times. Consider this schema definition { "type": "record", "name": "OrderBook", "namespace": "my.types", "doc": "Test order update", "fields": [ { …
Daniel
  • 1,522
  • 1
  • 12
  • 25
19
votes
1 answer

How to mix record with map in Avro?

I'm dealing with server logs which are JSON format, and I want to store my logs on AWS S3 in Parquet format(and Parquet requires an Avro schema). First, all logs have a common set of fields, second, all logs have a lot of optional fields which are…
soulmachine
  • 3,917
  • 4
  • 46
  • 56
18
votes
1 answer

optional array in avro schema

I'm wondering whether or not it is possible to have an optional array. Let's assume a schema like this: { "type": "record", "name": "test_avro", "fields" : [ {"name": "test_field_1", "type": "long"}, {"name":…
Philipp Pahl
  • 211
  • 1
  • 2
  • 6
18
votes
4 answers

How to decode/deserialize Avro with Python from Kafka

I am receiving from a remote server Kafka Avro messages in Python (using the consumer of Confluent Kafka Python library), that represent clickstream data with json dictionaries with fields like user agent, location, url, etc. Here is what a message…
18
votes
6 answers

Deserialize an Avro file with C#

I can't find a way to deserialize an Apache Avro file with C#. The Avro file is a file generated by the Archive feature in Microsoft Azure Event Hubs. With Java I can use Avro Tools from Apache to convert the file to JSON: java -jar…
Kristoffer Jälén
  • 4,112
  • 3
  • 30
  • 54
18
votes
3 answers

python Spark avro

When attempting to write avro, I get the following error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 35.0 failed 1 times, most recent failure: Lost task 7.0 in stage 35.0 (TID 110, localhost):…
Rolando
  • 58,640
  • 98
  • 266
  • 407
18
votes
3 answers

How to serialize a Date using AVRO in Java

I'm actually trying to serialize objects containing dates with Avro, and the deserialized date doesn't match the expected value (tested with avro 1.7.2 and 1.7.1). Here's the class I'm serializing : import java.text.SimpleDateFormat; import…
Miguel L.
  • 708
  • 2
  • 7
  • 11
17
votes
2 answers

Nesting Avro schemas

According to this question on nesting Avro schemas, the right way to nest a record schema is as follows: { "name": "person", "type": "record", "fields": [ {"name": "firstname", "type": "string"}, {"name": "lastname",…
Tianxiang Xiong
  • 3,887
  • 9
  • 44
  • 63
17
votes
1 answer

Why we need Avro schema evolution

I am new to Hadoop and programming, and I am a little confused about Avro schema evolution. I will explain what I understand about Avro so far. Avro is a serialization tool that stores binary data with its json schema at the top. The schema looks…
Anaadih.pradeep
  • 2,453
  • 4
  • 18
  • 25
17
votes
1 answer

how to mark avro field deprecated in JSON/avsc?

I was looking for method to mark avro field deprecated in a way that generated Java code (getters, and setters for the field) are marked with @Deprecated annotation. Puting @Deprecated into "doc" field doesn't work, because generator puts it into…
tworec
  • 4,409
  • 2
  • 29
  • 34
17
votes
2 answers

Encode an object with Avro to a byte array in Python

In python 2.7, using Avro, I'd like to encode an object to a byte array. All examples I've found write to a file. I've tried using io.BytesIO() but this gives: AttributeError: '_io.BytesIO' object has no attribute 'write_long' Sample using…
Grant Overby
  • 171
  • 1
  • 1
  • 4