Questions tagged [avro]

Apache Avro is a data serialization framework primarily used in Apache Hadoop.

Apache Avro is a data serialization system.

Features:

Rich data structures.
A compact, fast, binary data format.
A container file, to store persistent data.
Remote procedure call (RPC).
Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages.

Schemas:

Avro relies on schemas. When Avro data is read, the schema used when writing it is always present. This permits each datum to be written with no per-value overheads, making serialization both fast and small. This also facilitates use with dynamic, scripting languages, since data, together with its schema, is fully self-describing.

When Avro data is stored in a file, its schema is stored with it, so that files may be processed later by any program. If the program reading the data expects a different schema this can be easily resolved, since both schemas are present.

When Avro is used in RPC, the client and server exchange schemas in the connection handshake. (This can be optimized so that, for most calls, no schemas are actually transmitted.) Since both client and server both have the other's full schema, correspondence between same named fields, missing fields, extra fields, etc. can all be easily resolved.

Avro schemas are defined with JSON . This facilitates implementation in languages that already have JSON libraries.

Comparison with other systems:

Avro provides functionality similar to systems such as Thrift, Protocol Buffers, etc. Avro differs from these systems in the following fundamental aspects.

Dynamic typing: Avro does not require that code be generated. Data is always accompanied by a schema that permits full processing of that data without code generation, static datatypes, etc. This facilitates construction of generic data-processing systems and languages.
Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size.
No manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names.

Available Languages: c - c++ - c# - java - javascript - julia - php - python - ruby

Official Website: http://avro.apache.org/

Useful Links:

Documentation
Getting Started (Java)
Getting Started (Python)
Specification
API Documentation:
- Java
- C
- C++
- C#
- Julia
IDL Language

3646 questions

votes

3 answers

Spark: Writing to Avro file

I am in Spark, I have an RDD from an Avro file. I now want to do some transformations on that RDD and save it back as an Avro file: val job = new Job(new Configuration()) AvroJob.setOutputKeySchema(job, getOutputSchema(inputSchema)) rdd.map(elem =>…

scala serialization avro apache-spark

asked Dec 16 '13 at 13:51

user1013725

votes

4 answers

Can you put comments in Avro JSON schema files?

I'm writing my first Avro schema, which uses JSON as the schema language. I know you cannot put comments into plain JSON, but I'm wondering if the Avro tool allows comments. E.g. Perhaps it strips them (like a preprocessor) before parsing the…

json avro

asked May 23 '13 at 01:39

jfritz42

5,913
5
50
66

votes

2 answers

Can I get a Scala case class definition from an Avro schema definition?

To facilitate working with Avro in Scala, I'd like to define a case class based on the schema stored with a .avro file. I could try: Writing a .scala case class definition by hand. Programmatically writing strings to a .scala file Spoof the…

scala schema case-class avro

asked Mar 25 '13 at 03:15

Julian Peeters

votes

1 answer

Kafka Streams - SerializationException: Unknown magic byte

I am trying to create a Kafka Streams Application which processes Avro records, but I am getting the following error: Exception in thread "streams-application-c8031218-8de9-4d55-a5d0-81c30051a829-StreamThread-1"…

java apache-kafka avro apache-kafka-streams confluent-schema-registry

asked Dec 18 '18 at 14:42

R. B

votes

2 answers

Why is Spark performing worse when using Kryo serialization?

I enabled Kryo serialization for my Spark job, enabled the setting to require registration, and ensured all my types were registered. val conf = new SparkConf() conf.set("spark.serializer",…

scala performance apache-spark avro kryo

asked Jan 09 '17 at 17:05

Leif Wickland

3,693
26
43

votes

4 answers

How to read Avro file in PySpark

I am writing a spark job using python. However, I need to read in a whole bunch of avro files. This is the closest solution that I have found in Spark's example folder. However, you need to submit this python script using spark-submit. In the…

python apache-spark avro pyspark

asked Apr 20 '15 at 22:57

B.Mr.W.

18,910
35
114
178

votes

3 answers

Avro schema definition nesting types

I am fairly new to Avro and going through documentation for nested types. I have the example below working nicely but many different types within the model will have addresses. Is it possible to define an address.avsc file and reference that as a…

avro

asked Mar 26 '15 at 14:08

derdc

1,081
2
7
19

votes

2 answers

How to define avro schema for complex json document?

I have a JSON document that I would like to convert to Avro and need a schema to be specified for that purpose. Here is the JSON document for which I would like to define the avro schema: { "uid": 29153333, "somefield": "somevalue", "options": [ …

json serialization mapreduce avro

asked Jan 27 '15 at 04:24

user2727704

votes

2 answers

Use kafka-avro-console-producer with a schema already in the schema registry

I would like to use the kafka-avro-console-producer with the schema registry. I have big schemas (over 10k chars) and I can't really past them as a command line argument. Besides that I'd like to use the schema registry directly so I can use a…

apache-kafka avro confluent-schema-registry

asked Jan 03 '20 at 16:45

0x26res

11,925
11
54
108

votes

2 answers

Python AVRO reader returns AssertionError when decoding kafka messages

Newbie playing with Kafka and AVRO. I am trying to deserialise AVRO messages in Python 3.7.3 using kafka-python, avro-python3 packages and following this answer. The function responsible for decoding the Kafka messages is def…

python avro kafka-python

asked Oct 20 '19 at 12:45

Mattia Paterna

1,268
3
15
31

votes

4 answers

Getting Started with Avro

I want to get started with using Avro with Map Reduce. Can Someone suggest a good tutorial / example to get started with. I couldnt find much through the internet search.

mapreduce avro

asked Mar 29 '11 at 23:48

Sri

votes

1 answer

Unnesting in SQL (Athena): How to convert array of structs into an array of values plucked from the structs?

I am taking samples from a Bayesian statistical model, serializing them with Avro, uploading them to S3, and querying them with Athena. I need help writing a query that unnests an array in the table. The CREATE TABLE query looks like: CREATE…

sql row avro amazon-athena unnest

asked Jan 27 '18 at 02:51

Count Zero

votes

2 answers

Storing null values in avro files

I have some json data that looks like this: { "id": 1998983092, "name": "Test Name 1", "type": "search string", "creationDate": "2017-06-06T13:49:15.091+0000", "lastModificationDate": "2017-06-28T14:53:19.698+0000", …

java avro avro-tools

asked Aug 13 '17 at 16:24

mba12

2,702
6
37
56

votes

1 answer

module 'avro.schema' has no attribute 'parse'

i am new to python and i was trying to write a simple code for converting a text file to avro. i am getting this error that module not found. I could clearly see in the schema.py file that the parse module exists. I will appreciate if someone could…

python parsing avro

asked Dec 31 '16 at 05:22

Kevin K

votes

2 answers

KafkaAvroDeserializer does not return SpecificRecord but returns GenericRecord

My KafkaProducer is able to use KafkaAvroSerializer to serialize objects to my topic. However, KafkaConsumer.poll() returns deserialized GenericRecord instead of my serialized class. MyKafkaProducer KafkaProducer producer; …

java apache-kafka avro confluent-platform confluent-schema-registry

asked Sep 21 '16 at 01:09

Glide

20,235
26
86
135

Prev 1 2 3

…

99 100 Next