Questions tagged [avro]

Apache Avro is a data serialization framework primarily used in Apache Hadoop.

Apache Avro is a data serialization system.

Features:

  • Rich data structures.
  • A compact, fast, binary data format.
  • A container file, to store persistent data.
  • Remote procedure call (RPC).
  • Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages.

Schemas:

Avro relies on schemas. When Avro data is read, the schema used when writing it is always present. This permits each datum to be written with no per-value overheads, making serialization both fast and small. This also facilitates use with dynamic, scripting languages, since data, together with its schema, is fully self-describing.

When Avro data is stored in a file, its schema is stored with it, so that files may be processed later by any program. If the program reading the data expects a different schema this can be easily resolved, since both schemas are present.

When Avro is used in RPC, the client and server exchange schemas in the connection handshake. (This can be optimized so that, for most calls, no schemas are actually transmitted.) Since both client and server both have the other's full schema, correspondence between same named fields, missing fields, extra fields, etc. can all be easily resolved.

Avro schemas are defined with JSON . This facilitates implementation in languages that already have JSON libraries.

Comparison with other systems:

Avro provides functionality similar to systems such as Thrift, Protocol Buffers, etc. Avro differs from these systems in the following fundamental aspects.

  • Dynamic typing: Avro does not require that code be generated. Data is always accompanied by a schema that permits full processing of that data without code generation, static datatypes, etc. This facilitates construction of generic data-processing systems and languages.
  • Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size.
  • No manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names.

Available Languages: - - - - - - - -

Official Website: http://avro.apache.org/

Useful Links:

3646 questions
15
votes
7 answers

JsonMappingException when serializing avro generated object to json

I used avro-tools to generate java classes from avsc files, using: java.exe -jar avro-tools-1.7.7.jar compile -string schema myfile.avsc Then I tried to serialize such objects to json by ObjectMapper, but always got a JsonMappingException saying…
Amir
  • 391
  • 2
  • 3
  • 10
15
votes
1 answer

Avro Schema. How to set type to "record" and "null" at once

I need to mix "record" type with null type in Schema. "name":"specShape", "type":{ "type":"record", "name":"noSpecShape", "fields":[ { "name":"bpSsc", …
Nadir Novruzov
  • 467
  • 1
  • 6
  • 16
15
votes
2 answers

How to convert from GenericRecord to SpecificRecord in Avro for compatible schemas

Is the Avro SpecificRecord (i.e. the generated java classes) compatible with schema evolution? I.e. if I have a source of Avro messages (in my case, kafka) and I want to deserialize those messages to a specificrecord, is it possible to do…
Mark D
  • 5,368
  • 3
  • 25
  • 32
15
votes
6 answers

Apache Avro: map uses CharSequence as key

I am using Apache Avro. My schema has map type: {"name": "MyData", "type" : {"type": "map", "values":{ "type": "record", "name": "Person", "fields":[ …
Mellon
  • 37,586
  • 78
  • 186
  • 264
14
votes
1 answer

Avro ENUM field

I am trying to create Union field in Avro schema and send corresponding JSON message with it but to have one of the fields - null. https://avro.apache.org/docs/1.8.2/spec.html#Unions What is example of simplest UNION type (avro schema) with…
user9750148
  • 375
  • 3
  • 5
  • 14
14
votes
1 answer

using AWS Glue with Apache Avro on schema changes

I am new to AWS Glue and am having difficulty fully understanding the AWS docs, but am struggling through the following use case: We have an s3 bucket with a number of Avro files. We have decided to use Avro due to having extensive support for data…
CharStar
  • 427
  • 1
  • 6
  • 24
14
votes
3 answers

Where is an Avro schema stored when I create a hive table with 'STORED AS AVRO' clause?

There are at least two different ways of creating a hive table backed with Avro data: Creating a table based on an Avro schema (in this example, stored in hdfs): CREATE TABLE users_from_avro_schema ROW FORMAT SERDE…
tomek
  • 771
  • 1
  • 7
  • 19
14
votes
4 answers

Use schema to convert AVRO messages with Spark to DataFrame

Is there a way to use a schema to convert avro messages from kafka with spark to dataframe? The schema file for user records: { "fields": [ { "name": "firstName", "type": "string" }, { "name": "lastName", "type": "string" } ], "name":…
Sascha Vetter
  • 2,466
  • 1
  • 19
  • 36
14
votes
1 answer

Avro: deserialize json - schema with optional fields

There are a lot of questions and answers on stackoverflow on the subject, but no one that helps. I have a schema with optional value: { "type" : "record", "name" : "UserSessionEvent", "namespace" : "events", "fields" : [ { "name" :…
Pavel Bernshtam
  • 4,232
  • 8
  • 38
  • 62
14
votes
3 answers

How to read and write Map from/to parquet file in Java or Scala?

Looking for a concise example on how to read and write Map from/to parquet file in Java or Scala? Here is expected structure, using com.fasterxml.jackson.databind.ObjectMapper as serializer in Java (i.e. looking for equivalent using…
okigan
  • 1,559
  • 2
  • 18
  • 33
14
votes
3 answers

Does Avro schema evolution require access to both old and new schemas?

If I serialize an object using a schema version 1, and later update the schema to version 2 (say by adding a field) - am I required to use schema version 1 when later deserializing the object? Ideally I would like to just use schema version 2 and…
bils
  • 143
  • 1
  • 5
13
votes
3 answers

Avro Java API Timestamp Logical Type?

With the Avro Java API, I can make a simple record schema like: Schema schemaWithTimestamp = SchemaBuilder .record("MyRecord").namespace("org.demo") .fields() .name("timestamp").type().longType().noDefault() …
clay
  • 18,138
  • 28
  • 107
  • 192
13
votes
6 answers

Start Confluent Schema Registry in windows

I have windows environment and my own set of kafka and zookeeper running. To use custom objects, I started to use Avro. But I needed to get the registry started. Downloaded Confluent platform and ran this: $ ./bin/schema-registry-start…
13
votes
3 answers

Avro schema doesn't honor backward compatibilty

I have this avro schema { "namespace": "xx.xxxx.xxxxx.xxxxx", "type": "record", "name": "MyPayLoad", "fields": [ {"name": "filed1", "type": "string"}, {"name": "filed2", "type": "long"}, {"name": "filed3", "type":…
Raghvendra Singh
  • 1,775
  • 4
  • 26
  • 53
13
votes
2 answers

How to generate schema-less avro files using apache avro?

I am using Apache avro for data serialization. Since, the data has a fixed schema I do not want the schema to be a part of serialized data. In the following example, schema is a part of the avro file "users.avro". User user1 = new…
mintra
  • 317
  • 1
  • 5
  • 13