Questions tagged [spark-avro]

A library for reading and writing Avro data from Spark SQL.

The GitHub page is here.

227 questions
2
votes
1 answer

Missing Avro Custom Header when using Spark SQL Streaming

Before sending an Avro GenericRecord to Kafka, a Header is inserted like so. ProducerRecord record = new ProducerRecord<>(topicName, key, message); record.headers().add("schema", schema); Consuming the record. When using Spark…
2
votes
2 answers

FileNotFoundException: Spark save fails. Cannot clear cache from Dataset[T] avro

I get the following error when saving a dataframe in avro for a second time. If I delete sub_folder/part-00000-XXX-c000.avro after saving, and then try to save the same dataset, I get the following: FileNotFoundException: File…
2
votes
1 answer

org.apache.avro.AvroTypeException: Expected record-start. Got VALUE_STRING

I am doing simple json to Avro Record conversion, But I am getting this issue, I tried lot of ways, I applied more than 15 solutions from stackoverflow and online. My File look like this { "namespace": "test", "type": "record", "name":…
Sun
  • 3,444
  • 7
  • 53
  • 83
2
votes
1 answer

DataFrameReader throwing "Unsupported type NULL" while reading avro file

I am trying to read an avro file with DataFrame, but keep getting: org.apache.spark.sql.avro.IncompatibleSchemaException: Unsupported type NULL Since I am going to deploy it on Dataproc I am using Spark 2.4.0, but the same happened when I tried…
2
votes
0 answers

How to make streaming query with select over avro struct faster?

I'm using Spark structured streaming with Kafka streaming source and Avro format and the creation of dataframe is very slow! In order to measure the streaming query I have to add an action in order to evaluate the DAG and calculate the time. If I…
ggeop
  • 1,230
  • 12
  • 24
2
votes
0 answers

Default schema value conversion fails in to_avro() while publishing data to Kafka using databricks spark-avro

Trying to publish data into Kafka topic using confluent schema registry. Following is my schema registry schemaRegistryClient.register("primitive_type_str_avsc", new Schema.Parser().parse( s""" |{ | "type": "record", | "name":…
2
votes
1 answer

Hive External table on AVRO file producing only NULL data for all columns

I am trying to create an Hive external table on top of some avro files which are generated using spark-scala. I am using CDH 5.16 which has hive 1.1, spark 1.6. I created hive external table, which ran successfully. But when i query the data i am…
Vaishak
  • 607
  • 3
  • 8
  • 30
2
votes
0 answers

How to include external avro packages for Spark 2.4 into Junit?

I could have asked how can I avoid Avro is built-in but external data source module since Spark 2.4 I have been using the following approach to bootstrap my session in junit (this approach works for all my my other tests). sparkSession =…
hba
  • 7,406
  • 10
  • 63
  • 105
2
votes
0 answers

Block size invalid or too large - Failed to read Avro files

I'm using spark and scala , and trying to read avro folders using com.databricks - spark-avro_2.11. All the folders were read successfully, except for one folder, which failed with the following exception. (attached) I checked the files manually,…
Ben Haim Shani
  • 265
  • 4
  • 15
2
votes
1 answer

How to write a pyspark-dataframe to redshift?

I am trying to write a pyspark DataFrame to Redshift but it results into error:- java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.avro.AvroFileFormat could not be instantiated Caused…
murtaza1983
  • 247
  • 2
  • 8
2
votes
1 answer

How can I set a logicalType in a spark-avro 2.4 schema?

We read timestamp information from avro files in our application. I am in the process of testing an upgrade from Spark 2.3.1 to Spark 2.4 which includes the newly built-in spark-avro integration. However, I cannot figure out how to tell the avro…
Matt Ford
  • 46
  • 7
2
votes
3 answers

Spark reading Avro file

I'm using com.databricks.spark.avro. When I run it from spark-shell like so: spark-shell --jar spark-avro_2.11-4.0.0.jar, I am able to read the file by doing this: import org.apache.spark.sql.SQLContext val sqlContext = new SQLContext(sc) val…
covfefe
  • 2,485
  • 8
  • 47
  • 77
2
votes
1 answer

Converting StructType to Avro Schema, returns type as Union when using databricks spark-avro

I am using databricks spark-avro to convert a dataframe schema into avro schema.The returned avro schema fails to have a default value. This is causing issues when i am trying to create a Generic record out of the schema. Can, any one help with the…
2
votes
2 answers

Schema in Avro message

I see that the Avro messages have the schema embedded, and then the data in binary format. If multiple messages are sent and new avro files are getting created for every message, is not Schema embedding an overhead? So, does that mean, it is always…
Roshan Fernando
  • 493
  • 11
  • 31
2
votes
0 answers

Read/access primitive double array from parquet using Spark using Java api

I have a Parquet file generated using the parquet-avro library, where one of the field has primitive double array, created using the following schema type: Schema.createArray(Schema.create(Schema.Type.DOUBLE)) I read this parquet data from Spark…