Questions tagged [spark-avro]

A library for reading and writing Avro data from Spark SQL.

The GitHub page is here.

227 questions
5
votes
1 answer

How to save complex json or complex objects as Parquet in Spark?

I'm new to Spark and I'm trying to figure out if there is a way to save complex objects (nested) or complex jsons as Parquet in Spark. I'm aware of the Kite SDK, but I understand it uses Map/Reduce. I looked around but I was unable to find a…
IceMan
  • 1,398
  • 16
  • 35
5
votes
2 answers

How to write avro to multiple output directory using spark

Hi,There is a topic about writing text data into multiple output directories in one spark job using MultipleTextOutputFormat Write to multiple outputs by key Spark - one Spark job I would ask if there is some similar way to write avro data to…
Tom
  • 5,848
  • 12
  • 44
  • 104
5
votes
1 answer

Skipping fields in a record using spark-avro

Update: spark-avro package was update to support this scenario. https://github.com/databricks/spark-avro/releases/tag/v3.1.0 I have an AVRO file that was created by a third party outside my control, which I need to process using spark. The AVRO…
itaysk
  • 5,852
  • 2
  • 33
  • 40
4
votes
3 answers

Can I write multiple DataFrames in parallel in Spark?

I have question i want to sequentially write many dataframe in avro format and i use the code below in a for loop. df .repartition() .write .mode() .avro() The problem is when i run my spark job…
4
votes
1 answer

How to call avro SchemaConverters in Pyspark

Although PySpark has Avro support, it does not have the SchemaConverters method. I may be able to use Py4J to accomplish this, but I have never used a Java package within Python. This is the code I am using # Import SparkSession from pyspark.sql…
H. Trujillo
  • 427
  • 1
  • 8
  • 21
4
votes
1 answer

java.lang.NoSuchMethodError when reading an avro file using PySpark

I'm trying to load an avro file using PySpark running on Dataproc Job: spark_session.read.format("avro").load("/path/to/avro") I'm getting de flowing error: File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 166, in…
4
votes
2 answers

Spark DataFrame: How to specify schema when writing as Avro

I want to write a DataFrame in Avro format using a provided Avro schema rather than Spark's auto-generated schema. How can I tell Spark to use my custom schema on write?
erwaman
  • 3,307
  • 3
  • 28
  • 29
4
votes
2 answers

Map Avro files on Java class with different field names

I've got a problem with simple spark task, which reads Avro file and then save it as Hive parquet table. I've got 2 types of file, in general they are the same, but the key struct is a little different - field names. Type 1 root |-- pk: strucnt…
Danila Zharenkov
  • 1,720
  • 1
  • 15
  • 27
4
votes
0 answers

Streaming avro files from a directory

I'm trying to set up a structured stream from a directory of Avro files. We already have some non-streaming code to deal with exact the same data, so the least-effort step forward to streaming would be to re-use that code. To move to…
4
votes
2 answers

Handling schema changes in running Spark Streaming application

I am looking to build a Spark Streaming application using the DataFrames API on Spark 1.6. Before I get too far down the rabbit hole, I was hoping someone could help me understand how DataFrames deals with data having a different schema. The idea…
Ben
  • 1,793
  • 2
  • 15
  • 22
4
votes
2 answers

Installing spark-avro

I'm trying to read avro files in pyspark. Found out from How to read Avro file in PySpark that spark-avro is the best way to do that but I can't figure out how to install that from their Github repo. There's no downloadable jar, do I build it…
noobman
  • 75
  • 1
  • 7
3
votes
1 answer

Why is adding org.apache.spark.avro dependency is mandatory to read/write avro files in Spark2.4 while I'm using com.databricks.spark.avro?

I tried to run my Spark/Scala code 2.3.0 on a Cloud Dataproc cluster 1.4 where there's Spark 2.4.8 installed. I faced an error concerning the reading of avro files. Here's my code…
3
votes
1 answer

How to write spark dataframe in a single file in local system without using coalesce

I want to generate an avro file from a pyspark dataframe and currently I am doing coalesce as below df = df.coalesce(1) df.write.format('avro').save('file:///mypath') But this is leading to memory issues now as all the data will be fetched to…
newbie
  • 1,282
  • 3
  • 20
  • 43
3
votes
1 answer

Schema Evolution Comparison Apache Avro Vs Apache Parquet

I would like to cross check my understanding about the differences in File Formats like Apache Avro and Apache Parquet in terms of Schema Evolution. Looking at various blogs and SO answers gives me the following understanding. I need to verify if my…
3
votes
1 answer

hive external table on avro timestamp field returning as long

I have avro data which has a single column timestamp column and now i am trying to create external hive table on top of the avro files .Data gets saved in avro as long and i expect the avro logical type to handle the conversion back to timestamp…
Ajith Kannan
  • 812
  • 1
  • 8
  • 30
1
2
3
15 16