Questions tagged [avro]

Apache Avro is a data serialization framework primarily used in Apache Hadoop.

Apache Avro is a data serialization system.

Features:

  • Rich data structures.
  • A compact, fast, binary data format.
  • A container file, to store persistent data.
  • Remote procedure call (RPC).
  • Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages.

Schemas:

Avro relies on schemas. When Avro data is read, the schema used when writing it is always present. This permits each datum to be written with no per-value overheads, making serialization both fast and small. This also facilitates use with dynamic, scripting languages, since data, together with its schema, is fully self-describing.

When Avro data is stored in a file, its schema is stored with it, so that files may be processed later by any program. If the program reading the data expects a different schema this can be easily resolved, since both schemas are present.

When Avro is used in RPC, the client and server exchange schemas in the connection handshake. (This can be optimized so that, for most calls, no schemas are actually transmitted.) Since both client and server both have the other's full schema, correspondence between same named fields, missing fields, extra fields, etc. can all be easily resolved.

Avro schemas are defined with JSON . This facilitates implementation in languages that already have JSON libraries.

Comparison with other systems:

Avro provides functionality similar to systems such as Thrift, Protocol Buffers, etc. Avro differs from these systems in the following fundamental aspects.

  • Dynamic typing: Avro does not require that code be generated. Data is always accompanied by a schema that permits full processing of that data without code generation, static datatypes, etc. This facilitates construction of generic data-processing systems and languages.
  • Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size.
  • No manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names.

Available Languages: - - - - - - - -

Official Website: http://avro.apache.org/

Useful Links:

3646 questions
17
votes
3 answers

Spark: Writing to Avro file

I am in Spark, I have an RDD from an Avro file. I now want to do some transformations on that RDD and save it back as an Avro file: val job = new Job(new Configuration()) AvroJob.setOutputKeySchema(job, getOutputSchema(inputSchema)) rdd.map(elem =>…
user1013725
  • 571
  • 1
  • 4
  • 17
17
votes
4 answers

Can you put comments in Avro JSON schema files?

I'm writing my first Avro schema, which uses JSON as the schema language. I know you cannot put comments into plain JSON, but I'm wondering if the Avro tool allows comments. E.g. Perhaps it strips them (like a preprocessor) before parsing the…
jfritz42
  • 5,913
  • 5
  • 50
  • 66
17
votes
2 answers

Can I get a Scala case class definition from an Avro schema definition?

To facilitate working with Avro in Scala, I'd like to define a case class based on the schema stored with a .avro file. I could try: Writing a .scala case class definition by hand. Programmatically writing strings to a .scala file Spoof the…
Julian Peeters
  • 853
  • 1
  • 6
  • 19
16
votes
1 answer

Kafka Streams - SerializationException: Unknown magic byte

I am trying to create a Kafka Streams Application which processes Avro records, but I am getting the following error: Exception in thread "streams-application-c8031218-8de9-4d55-a5d0-81c30051a829-StreamThread-1"…
16
votes
2 answers

Why is Spark performing worse when using Kryo serialization?

I enabled Kryo serialization for my Spark job, enabled the setting to require registration, and ensured all my types were registered. val conf = new SparkConf() conf.set("spark.serializer",…
Leif Wickland
  • 3,693
  • 26
  • 43
16
votes
4 answers

How to read Avro file in PySpark

I am writing a spark job using python. However, I need to read in a whole bunch of avro files. This is the closest solution that I have found in Spark's example folder. However, you need to submit this python script using spark-submit. In the…
B.Mr.W.
  • 18,910
  • 35
  • 114
  • 178
16
votes
3 answers

Avro schema definition nesting types

I am fairly new to Avro and going through documentation for nested types. I have the example below working nicely but many different types within the model will have addresses. Is it possible to define an address.avsc file and reference that as a…
derdc
  • 1,081
  • 2
  • 7
  • 19
16
votes
2 answers

How to define avro schema for complex json document?

I have a JSON document that I would like to convert to Avro and need a schema to be specified for that purpose. Here is the JSON document for which I would like to define the avro schema: { "uid": 29153333, "somefield": "somevalue", "options": [ …
user2727704
  • 625
  • 1
  • 10
  • 21
15
votes
2 answers

Use kafka-avro-console-producer with a schema already in the schema registry

I would like to use the kafka-avro-console-producer with the schema registry. I have big schemas (over 10k chars) and I can't really past them as a command line argument. Besides that I'd like to use the schema registry directly so I can use a…
0x26res
  • 11,925
  • 11
  • 54
  • 108
15
votes
2 answers

Python AVRO reader returns AssertionError when decoding kafka messages

Newbie playing with Kafka and AVRO. I am trying to deserialise AVRO messages in Python 3.7.3 using kafka-python, avro-python3 packages and following this answer. The function responsible for decoding the Kafka messages is def…
Mattia Paterna
  • 1,268
  • 3
  • 15
  • 31
15
votes
4 answers

Getting Started with Avro

I want to get started with using Avro with Map Reduce. Can Someone suggest a good tutorial / example to get started with. I couldnt find much through the internet search.
Sri
  • 201
  • 2
  • 4
  • 6
15
votes
1 answer

Unnesting in SQL (Athena): How to convert array of structs into an array of values plucked from the structs?

I am taking samples from a Bayesian statistical model, serializing them with Avro, uploading them to S3, and querying them with Athena. I need help writing a query that unnests an array in the table. The CREATE TABLE query looks like: CREATE…
Count Zero
  • 630
  • 1
  • 6
  • 15
15
votes
2 answers

Storing null values in avro files

I have some json data that looks like this: { "id": 1998983092, "name": "Test Name 1", "type": "search string", "creationDate": "2017-06-06T13:49:15.091+0000", "lastModificationDate": "2017-06-28T14:53:19.698+0000", …
mba12
  • 2,702
  • 6
  • 37
  • 56
15
votes
1 answer

module 'avro.schema' has no attribute 'parse'

i am new to python and i was trying to write a simple code for converting a text file to avro. i am getting this error that module not found. I could clearly see in the schema.py file that the parse module exists. I will appreciate if someone could…
Kevin K
  • 151
  • 1
  • 1
  • 4
15
votes
2 answers

KafkaAvroDeserializer does not return SpecificRecord but returns GenericRecord

My KafkaProducer is able to use KafkaAvroSerializer to serialize objects to my topic. However, KafkaConsumer.poll() returns deserialized GenericRecord instead of my serialized class. MyKafkaProducer KafkaProducer producer; …
Glide
  • 20,235
  • 26
  • 86
  • 135