Questions tagged [spark-avro]

A library for reading and writing Avro data from Spark SQL.

The GitHub page is here.

227 questions
1
vote
1 answer

Avro backward compatibility doesn't work as expected

I have two Avro schema V1 and V2 which are read in spark as below: import org.apache.spark.sql.avro.functions._ val jsonFormatSchema = new String(Files.readAllBytes(Paths.get("./examples/src/main/resources/V1.avsc"))) val df = spark .readStream …
darkknight444
  • 546
  • 8
  • 21
1
vote
0 answers

Avro vs ORC memory consumption/performance in Spark

If I have 100GB of avro dataset and I have the same dataset in ORC that is 10GB. If I read in the ORC data in Spark, does it consume less memory than the Avro dataset? I was thinking since all the data gets loaded into memory and deserialized, maybe…
Ryan
  • 1,102
  • 1
  • 15
  • 30
1
vote
1 answer

PySpark works in terminal but not when executed in Python code

I am trying to read an avro file type. the following is the sample data source that I have found online to test my code: https://github.com/Teradata/kylo/blob/master/samples/sample-data/avro/userdata1.avro The following is my code (please assume…
1
vote
1 answer

Serializer for Avro Schema

I am new to Avro Schema. I had created the following schema based upon reference JSON but I am not able to create a serializer for this. { "name": "Name", "type": "record", "namespace": "NameSpace", "fields": [ { "name":…
1
vote
1 answer

What FileOutputCommitter should be used in when writing AVRO files in Spark?

When saving an RDD to S3 in AVRO, I get the following warning in the console: Using standard FileOutputCommitter to commit work. This is slow and potentially unsafe. I haven't been able to find a simple implicit such as saveAsAvroFile and…
Mridang Agarwalla
  • 43,201
  • 71
  • 221
  • 382
1
vote
1 answer

org.apache.avro.UnresolvedUnionException: Not in union [{"type":"bytes","logicalType":"decimal","precision":18,"scale":4},"null"]: 0.0000

I am trying to read data stored in a hive table in s3, covert it to Avro format and then consume the Avro records to build the final object and push it to a kafka topic. In the object I am trying to publish, I have a nested object that has fields…
user1868273
  • 51
  • 2
  • 5
1
vote
1 answer

Spark - Wide/sparse dataframe persistence

I want to persist a very wide Spark Dataframe (>100'000 columns) that is sparsely populated (>99% of values are null) while keeping only non-null values (to avoid storage cost): What is the best format for such use case (HBase, Avro, Parquet, ...)…
py-r
  • 419
  • 5
  • 15
1
vote
0 answers

Spark PySpark StructType StructField json to avro

I am trying to read json file and write it do avro. I used PySpark StructType & StructField classes to programmatically specify the schema to the DataFrame. I am trying to read json file and write it to avro format with logicalType set to …
losforword
  • 90
  • 7
1
vote
1 answer

How does avro partition pruning work internally?

I have a daily job which converts avro to parquet. Avro file per hour is 20G and is partitioned by year, month, day and hour when I read the avro file like the way below, spark.read.format("com.databricks.spark.avro").load(basePath).where($year=2020…
Gladiator
  • 354
  • 3
  • 19
1
vote
0 answers

Wrapping StructField to use it in from_avro()

I'm running a test where I create a DataFrame, encode a field using to_avro() and then decode it using from_avro(). The initial (not-encoded) DataFrame's schema has got the StructField instance for the field I encode in Avro. My goal is to use the…
Vlad.Bachurin
  • 1,340
  • 1
  • 14
  • 22
1
vote
1 answer

copy avro schema of one data frame to another-pyspark

I have a dataset A with schema A , also dataset B with Schema B. Both datasets A and B are mostly similar(have same columns ,but data types are different for only few), but have minor differences.One example being a column in dataset A has date…
chaithanya
  • 11
  • 2
1
vote
1 answer

Can we automate spark SQL query generation from AVRO schema?

I'm working on project where everyday I need to deal with tons of AVRO files. To extract the data from AVRO I use sparkSQL. To achieve this first I need to printSchema and then I need to select the fields to see the data. I want to automate this…
Teja
  • 31
  • 7
1
vote
1 answer

How to append data in existing AVRO file using Python

I have a dataframe with similar schema, I need to append the data into the AVRO file. I don't like to add the avro file into folder as a part. For your information, my AVRO file is not into the folder as a part. Can you please help me to solve the…
1
vote
1 answer

Avro - Add doc/description for each field

We are using avro for our schema definition. Is it possible to add field description for each of the fields in avro. I agree that we can add 'doc' at record level. we wanted to add description at a field level.
SunilS
  • 2,030
  • 5
  • 34
  • 62
1
vote
1 answer

Spark from_avro function with schema registry support

I am trying to use confluent schema reigstry with sparks from_avro function as per this doc. I have the below imports: "io.confluent" % "kafka-schema-registry-client" % "5.4.1", "io.confluent" % "kafka-avro-serializer" % "5.4.1", "org.apache.spark"…
irrelevantUser
  • 1,172
  • 18
  • 35