Questions tagged [avro]

Apache Avro is a data serialization framework primarily used in Apache Hadoop.

Apache Avro is a data serialization system.

Features:

  • Rich data structures.
  • A compact, fast, binary data format.
  • A container file, to store persistent data.
  • Remote procedure call (RPC).
  • Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages.

Schemas:

Avro relies on schemas. When Avro data is read, the schema used when writing it is always present. This permits each datum to be written with no per-value overheads, making serialization both fast and small. This also facilitates use with dynamic, scripting languages, since data, together with its schema, is fully self-describing.

When Avro data is stored in a file, its schema is stored with it, so that files may be processed later by any program. If the program reading the data expects a different schema this can be easily resolved, since both schemas are present.

When Avro is used in RPC, the client and server exchange schemas in the connection handshake. (This can be optimized so that, for most calls, no schemas are actually transmitted.) Since both client and server both have the other's full schema, correspondence between same named fields, missing fields, extra fields, etc. can all be easily resolved.

Avro schemas are defined with JSON . This facilitates implementation in languages that already have JSON libraries.

Comparison with other systems:

Avro provides functionality similar to systems such as Thrift, Protocol Buffers, etc. Avro differs from these systems in the following fundamental aspects.

  • Dynamic typing: Avro does not require that code be generated. Data is always accompanied by a schema that permits full processing of that data without code generation, static datatypes, etc. This facilitates construction of generic data-processing systems and languages.
  • Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size.
  • No manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names.

Available Languages: - - - - - - - -

Official Website: http://avro.apache.org/

Useful Links:

3646 questions
1
vote
0 answers

avro schema with json encoding - how to determine schema back from serialized data

I want to use apache avro schema's for data serialization and deserialization. I want to use it with json encoding. I want to put several of this serialized objects using different schemas to the same "source" (it's a kafka topic). When I read it…
snap
  • 1,598
  • 1
  • 14
  • 21
1
vote
1 answer

java.lang.Instantiation Exception while deserializing a byte stream into a Scala case class object

I am trying to deserialize an avro byte stream into a scala case class object. Basically, i had a kafka stream with avro encoded data flowing and now there is an addition to the schema and i am trying to update the scala case class to include the…
Amit Arora
  • 169
  • 3
  • 15
1
vote
1 answer

AVRO, convert record to array

My AVRO schema has a fileObject record but I need to change this to be an array of fileObject. How can I do this? { "name": "file", "type": ["null", { "type": "record", "name":…
mrmannione
  • 749
  • 10
  • 29
1
vote
1 answer

Confluent - Splitting Avro messages from one kafka topic into multiple kafka topics

We have an incoming kafka topic with multiple Avro schema based messages serialized into it. We need to split the messages in Avro format into multiple other kafka topics based on certain value of a common schema attribute. …
1
vote
1 answer

Confluent Schema Registry - Error while forwarding register schema request

I get the below error on issuing this -> curl -X POST -i -H "Content-Type: application/vnd.schemaregistry.v1+json" --data '{"schema": "{\"type\": \"string\"}"}' http://127.0.0.1:8081/subjects/Kafka-key/versions to my confluent kafka client. Same…
1
vote
1 answer

Avro schema Java deepcopy issue with field order

I am currently looking into solutions for unexpected behaviour when dealing with particular AVRO schema evolution scenarios when using Java and doing a deepcopy in a consumer to parse the GenericRecord class into a specific class which was generated…
Oskar
  • 333
  • 4
  • 16
1
vote
1 answer

Serialize an avro object to string in python

In Python 3.7, I want to encode an Avro object to String. I found examples converting to byte array but not to string. Code to convert to byte array: def serialize(mapper, schema): bytes_writer = io.BytesIO() encoder =…
Sarthak Agrawal
  • 321
  • 4
  • 17
1
vote
1 answer

Ingesting CSV data into Hive using NiFi

i am trying to ingest csv data into Hive Database. for this purpose, i tried with listFile --> FetchFile --> ConvertCSVToAvro --> ConvertAvroToOrc --> PutHDFS csv data is converted into ORC format and data is loading into HDFS. On top of this HDFS…
user6325753
  • 585
  • 4
  • 10
  • 33
1
vote
0 answers

AvroIO fails java.lang.ClassCastException to read logicaltype date

I am using AvroIO from Apache Beam with Spark Runner. I have defined a avro record with field { "name" : "serviceDate", "type" : [ "null", { "type" : "int", "logicalType" : "date" } ], …
Anuj J
  • 123
  • 2
  • 7
1
vote
1 answer

Setting attributes to null using a config file set by the user

I am working on a project where I was asked to make it so that the user can enter attributes that they want obscured from a .yml config file. The attribute already has a value in it and it will be set to null based on whether the user wants that…
NickS
  • 84
  • 10
1
vote
1 answer

count(*) on Avro table returns 0

I've recently moved to using AvroSerDe for my External tables in Hive. Select col_name,count(*) from table group by col_name; The above query gives me a count. Where as the below query does not: Select count(*) from table;
YSI
  • 11
  • 2
1
vote
2 answers

avro-python3 doesn't provide schema evolution?

I try to recreate a schema evolution case with avro-python3 (backward compatibility). I have two schemas: import avro.schema from avro.datafile import DataFileReader, DataFileWriter from avro.io import DatumReader, DatumWriter schema_v1 =…
Eugene
  • 53
  • 1
  • 5
1
vote
1 answer

Kafka Connect ignoring the Subject Strategies specified

I want to publish multiple table data on to same Kafka topic using the below connector config, but I am seeing below exception Exception Caused by: io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Schema being registered…
1
vote
1 answer

Spring Cloud Stream Kafka with confluent schema registry failed to send Message to channel output

I would like to test Spring Cloud Stream with Confluent Schema Registry and Avro Schema evolution to integrate it with my application. I have found out that Spring Cloud Stream does not support a secure connection to Confluent Schema Registry and…
1
vote
1 answer

How get the stream from kafka topic to elasticsearch with confluent?

I'm read data from machine and stream it as JSON to a kafka topic. I would like to read this topic and store the streamdata into elasticsearch with confluent. My steps: 1. Create KSQL Streams to convert from JSON to AVRO json stream: CREATE STREAM…
1 2 3
99
100