Questions tagged [spark-avro]

A library for reading and writing Avro data from Spark SQL.

The GitHub page is here.

227 questions
3
votes
1 answer

Read Avro in Azure HDI4.0

I'm trying to read an Avro file using Jupyter notebook in Azure HDInsight 4.0 with Spark 2.4. I'm not able to provide properly the .jar file to I've tried the approach suggested in How to use Avro on HDInsight Spark/Jupyter? and in…
MDP89
  • 306
  • 1
  • 9
3
votes
1 answer

Can not read avro in DataProc Spark with spark-avro

I have a cluster on Google DataProc (with image 1.4) and I want to read avro files with Spark from google cloud storage. I follow this guide: Spark read avro. The command I ran is: gcloud dataproc jobs submit pyspark test.py \ --cluster…
user2830451
  • 2,126
  • 5
  • 25
  • 31
3
votes
2 answers

Spark Read multiple paths with automatic partitions discovery

I'm trying to read some avro files to a DataFrame from multiple path. Let's say my path is "s3a://bucket_name/path/to/file/year=18/month=11/day=01" Under this path I have two more partitions let's say country=XX/region=XX I want to read multiple…
R.Peretz
  • 71
  • 2
  • 10
3
votes
0 answers

Spark avro predicate pushdown

We are using Avro data format and the data is partitioned by year, month, day, hour, min I see the data stored in HDFS as /data/year=2018/month=01/day=01/hour=01/min=00/events.avro And we load the data using val schema = new…
Vijay Muvva
  • 1,063
  • 1
  • 17
  • 31
3
votes
2 answers

Spark sql saveAsTable create table append mode if new column is added in avro schema

I am using Spark sql DataSet to write data into hive. Its working perfectly if schema is same but if I change the avro schema, adding new column in between, its showing the error (Schema is provided from schema registry) Error running job streaming…
Sumit G
  • 436
  • 8
  • 21
3
votes
1 answer

Setting Values in nested field in Avro Schema

I am trying to produce avro data into kafka using GenericData.Record but I am getting the following exception: Exception in thread "main" org.apache.avro.AvroRuntimeException: Not a valid schema field: emailAddresses.email Here is my Schema: { …
Sumit G
  • 436
  • 8
  • 21
3
votes
0 answers

Reading Avro messages from Kafka using Structured Streaming in Spark 2.1

I followed @Ralph Gonzalez's message on this thread reading Avro messages from Kafka using Structured Streaming in Spark 2.1, but am getting the following error. org.apache.avro.AvroRuntimeException: Malformed data. Length is negative: -40 at…
3
votes
2 answers

NoSuchMethodError using Databricks Spark-Avro 3.2.0

I have a spark master & worker running in Docker containers with spark 2.0.2 and hadoop 2.7. I'm trying to submit a job from pyspark from a different container (same network) by running df =…
arinarmo
  • 375
  • 1
  • 11
3
votes
1 answer

How to convert parquet file to Avro file?

I am new to hadoop and Big data Technologies. I like to convert a parquet file to avro file and read that data. I search in few forums and it suggested to use AvroParquetReader. AvroParquetReader reader = new…
PrinceChamp
  • 41
  • 1
  • 3
3
votes
0 answers

Complex json log data transformation using?

I am new to data science tools and have a use case to transform json logs into a flattened columnar data maybe considered as normal csv, I was looking into a lot of alternatives (tools) to approach this problem and found that I can easily solve this…
fireants
  • 191
  • 1
  • 11
2
votes
0 answers

Pyspark + Avro type conversion problems after transformation

I use Structured Streaming to read Avro records from a Kafka topic A, do some transformations and write as Avro to another Kafka topic B. I use those functions for serializing and deserializing the Avro records. I faced another exception (parsing…
JayKay
  • 152
  • 11
2
votes
0 answers

"Failed to find data source: avro" exception while writing Spark Dataframe to redshift

I am following community URL https://github.com/spark-redshift-community/spark-redshift#python to connect with Redshift and it seems to use avro dependencies although i am not using avro as input source data format. My scala is 2.12 and dependencies…
2
votes
0 answers

Circular Reference in Bean Class While Creating a Dataset from an Avro Generated Class

I have a class RawSpan.java that is Avro generated from the corresponding avdl defintion. I am trying to use this class to create a Dataframe to a Dataset in Spark as: val ds = df.select("value").select(from_avro($"value", "topic",…
Prashant Pandey
  • 4,332
  • 3
  • 26
  • 44
2
votes
0 answers

Databricks: Provide schema in dataframe column as a parameter for from_avro

I'm trying to use the function from_avro in a dataframe. This dataframe has its origin from a streamRead from kafka and at some point I create a column with the schemaId (related to schema registry) and the message. I then have an UDF that grabs the…
FEST
  • 813
  • 2
  • 14
  • 37
2
votes
1 answer

Pyspark writing dataframe to avro maintaining the sequence of key values

I am trying to read an avro file using pyspark and sort one of the columns based on certain keys. One of the columns in my avro file contains a MapType data which I need to sort based on keys. The test avro contains only one row with the entities…
ArinCool
  • 1,720
  • 1
  • 13
  • 24
1 2
3
15 16