Questions tagged [avro]

Apache Avro is a data serialization framework primarily used in Apache Hadoop.

Apache Avro is a data serialization system.

Features:

  • Rich data structures.
  • A compact, fast, binary data format.
  • A container file, to store persistent data.
  • Remote procedure call (RPC).
  • Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages.

Schemas:

Avro relies on schemas. When Avro data is read, the schema used when writing it is always present. This permits each datum to be written with no per-value overheads, making serialization both fast and small. This also facilitates use with dynamic, scripting languages, since data, together with its schema, is fully self-describing.

When Avro data is stored in a file, its schema is stored with it, so that files may be processed later by any program. If the program reading the data expects a different schema this can be easily resolved, since both schemas are present.

When Avro is used in RPC, the client and server exchange schemas in the connection handshake. (This can be optimized so that, for most calls, no schemas are actually transmitted.) Since both client and server both have the other's full schema, correspondence between same named fields, missing fields, extra fields, etc. can all be easily resolved.

Avro schemas are defined with JSON . This facilitates implementation in languages that already have JSON libraries.

Comparison with other systems:

Avro provides functionality similar to systems such as Thrift, Protocol Buffers, etc. Avro differs from these systems in the following fundamental aspects.

  • Dynamic typing: Avro does not require that code be generated. Data is always accompanied by a schema that permits full processing of that data without code generation, static datatypes, etc. This facilitates construction of generic data-processing systems and languages.
  • Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size.
  • No manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names.

Available Languages: - - - - - - - -

Official Website: http://avro.apache.org/

Useful Links:

3646 questions
1
vote
1 answer

Unable to cast Generic record to object

Hi I have below model in c#. These are Avro files. public ProductMessageEvents GetValue(TestPayload parameter) { return new ProductMessageEvents() { Metadata = new KafkaProductEvent.Metadata { …
Niranjan
  • 1,881
  • 6
  • 44
  • 71
1
vote
1 answer

How to handle deserializing avro map type in C#

Working on a project using the .net confluent kafka client, getting an exception when deserializing an avro map type. Is there a way of doing this in c#? The project is producing and consuming message from kafka. There are no issues around consuming…
1
vote
0 answers

OpenAPI / JDL and Avro schema: what is the best practice to generate entities/object definitions?

I am building a microservice piece of software with the API-first approach, I would like to re-use some of the entities in some other microservices. I can generate the object definitions at 3 different places: DTO from the open-api, Entities from…
Laurent
  • 31
  • 4
1
vote
0 answers

How to keep FULL_TRANSITIVE compatibility while adding new types to nested map in avro schema?

I have an existing avro schema that contains a field with a nested map of map of a record type (let's call it RecordA for now). I'm wondering if it's possible to add a new record type, RecordB, to this nested map of maps while maintaining…
sstannus
  • 23
  • 3
1
vote
1 answer

How to Fix Cached Schema Registry lookup causing terrible performance

Edit: I discovered this other question from a few years back (How to populate the cache in CachedSchemaRegistryClient without making a call to register a new schema?). It mentions that the CachedSchemaRegistryClient needs to register the schema to…
Jicaar
  • 1,044
  • 10
  • 26
1
vote
1 answer

Do avro and parquet formatted data have to be written within a hadoop infrastructure?

I've been researching the pros and cons of using avro, parquet, and other data sources for a project. If I am receiving input data from other groups of people who do not operate using Hadoop, will they be able to provide this input data in…
Jonathan Myers
  • 930
  • 6
  • 17
1
vote
1 answer

Recursive schema with avro (SchemaBuilder)

Is it possible to make an avro schema which is recursive, like Schema schema = SchemaBuilder .record("RecursiveItem") .namespace("com.example") .fields() .name("subItem") .type("RecursiveItem") .withDefault(null) // not sure…
Juh_
  • 14,628
  • 8
  • 59
  • 92
1
vote
2 answers

IllegalAccessError: tried to access method org.apache.avro.specific.SpecificData.()

Using Avro for serializing data to byte[] and deserializing data. https://cwiki.apache.org/confluence/display/AVRO/FAQ#FAQ-HowcanIserializedirectlyto/fromabytearray? shows sample usage. SpecificDatumReader reader = new…
pc70
  • 681
  • 1
  • 14
  • 28
1
vote
2 answers

Avro - java.io.IOException: Not a data file

I am using https://github.com/allegro/json-avro-converter to convert my json message into an avro file. After calling the convertToAvro method I get a byte array: byte[] byteArrayJson. Then I am using the commons library from…
agata
  • 481
  • 2
  • 9
  • 29
1
vote
2 answers

Spark 2.4.1 can not read Avro file from HDFS

I have a simple code block to write then read dataframe as Avro format. As the Avro lib already built in Spark 2.4.x, The Avro files writing went succeed and files are generated in HDFS. However AbstractMethodError exception is thrown when I read…
Martin Peng
  • 87
  • 1
  • 9
1
vote
1 answer

Use Kafka Streams with Avro Schema Registry

I'm seeking kafka-stream usage with schema-registry. I have google and could't find proper tutorial.
1
vote
1 answer

Is there another/similiar method for sparks.read.format.load outisde of databricks?

I am trying to load an avro file into a sparks dataframe so I can convert it to a pandas and eventually a dictionary. The method I want to use: df = spark.read.format("avro").load(avro_file_in_memory) (Note: the avro file data I'm trying to load…
Fastas
  • 79
  • 1
  • 8
1
vote
2 answers

Generate classes with decimal datatype with avro-maven-plugin 1.9.0

I have an Avro schema with some fields defined as decimal logical type. From that schema I generate classes using avro-maven-plugin (using version 1.9.0). I would like to avoid generating ByteBuffer type and use BigDecimal instead. I found in older…
traczovsky
  • 358
  • 4
  • 10
1
vote
1 answer

How to create a table from avro schema (.avsc)?

I have an avro schema file and I need to create a table in Databricks through pyspark. I don't need to load the data, just want to create the table. The easy way is to load the JSON string and take the "name" and "type" from fields array. Then…
Anirban Nag 'tintinmj'
  • 5,572
  • 6
  • 39
  • 59
1
vote
1 answer

Using the KafkaAvroDeserializer with Alpakka

I have a SchemaRegistry and a KafkaBroker from which I pull data with Avro v1.8.1. For deserialization I've been using Confluent's KafkaAvroDeserializer. Now I've meant to refactor my code in order to use the Elasticsearch API provided by Alpakka,…
styps
  • 279
  • 2
  • 14
1 2 3
99
100