Questions tagged [spark-avro]

A library for reading and writing Avro data from Spark SQL.

The GitHub page is here.

227 questions
0
votes
1 answer

Unable to access de-serialized nested avro generic record elements in scala

I am using Structured Streaming (Spark 2.4.0) to read avro mesages through kafka and using Confluent schema-Registry to receive/read schema I am unable to access the deeply nested fields. Schema looks like this in compacted avsc…
0
votes
1 answer

Issue building Apache Spark with Avro

I am trying to build spark from master branch with ./build/sbt clean package I want to test something specific to spark-avro submodule. However when I run the ./bin/spark-shell and try: scala> import org.apache.spark.sql.avro._ I receive object avro…
irrelevantUser
  • 1,172
  • 18
  • 35
0
votes
4 answers

Spark Avro throws exception on file write: NoSuchMethodError

Any file write attempt of Avro format fails with the stack trace below. We are using Spark 2.4.3 (with user provided Hadoop), Scala 2.12, and we load the Avro package at runtime with either spark-shell: spark-shell --packages…
0
votes
0 answers

how to use a file date automatically in scala?

I am reading an avro file from Azure data lake using databricks and I am using this path to read current date file for daily run, the code to drive the file date looks like this and it gets the current date fine. val pfdtm =…
HaiY
  • 145
  • 1
  • 5
  • 15
0
votes
1 answer

Getting exception reading from avro table using Spark or Hive console - Failed to obtain maxLength value for varchar field from file schema: "string"

I have created 2 tables in Hive CREATE external TABLE avro1(id INT,name VARCHAR(64),dept VARCHAR(64)) PARTITIONED BY (yoj VARCHAR(64)) STORED AS avro; CREATE external TABLE avro2(id INT,name VARCHAR(64),dept VARCHAR(64)) PARTITIONED BY (yoj…
0
votes
1 answer

How to read all columns from Avro when newer partitions have more columns then older ones?

I've got data in Avro format partitioned by date and time and I receiving new data every hour. Newer partitions can contain more columns then older ones. When I read it by Spark 2.4.3 I got DataFrame with schema of the first(oldest) partition and…
0
votes
0 answers

spark sql error when reading data from Avro Table

When I try reading data from an avro table using spark-sql, I am getting this error. Caused by: java.lang.NullPointerException at…
Srinivas
  • 2,010
  • 7
  • 26
  • 51
0
votes
0 answers

Why am I running out of memory when adding bulk documents to elastic search using bulk helpers?

I'm converting .avro files to JSON format, then parsing specific data items to be indexed on my elastic search cluster. Each chunk contains roughly 1.8 gigabytes of data and there are about 500 chunks. It doesn't take long to run out of memory, but…
Remixt
  • 597
  • 6
  • 28
0
votes
1 answer

Convert Array[Byte] to JSON format using Spark Scala

I'm reading an .avro file where the data of a particular column is in binary format. I'm currently converting the binary format to string format with the help of UDF for a readable purpose and then finally i will need to convert it into JSON format…
Anil Kumar
  • 525
  • 6
  • 27
0
votes
0 answers

Spark is unable to read the avro file format

When i query in Apache drill for an .avro file, i'm getting the Body column values correctly as shown below snapshot. But if i do the same in Spark-SQL, Body column values are coming in a binary format. Is there a way i can read the data correctly…
Anil Kumar
  • 525
  • 6
  • 27
0
votes
1 answer

How to assign constant values to the nested objects in pyspark?

I have a requirement where I need to mask the data for some of the fields in a given schema. I've researched a lot and couldn't find the answer that is needed. This is the schema where I need some changes on the fields(answer_type,response0,…
0
votes
0 answers

Read data from an Avro file and write to an Impala table

The project I am working receive data in the form a Avro files. I am writing a generic code in Scala (using Scala IDE) to read all Avro files present in a folder and create a table for each of the avro file. I am reading the avro file data as a…
0
votes
1 answer

Trouble reading avro files in Jupyter notebook using pyspark

I am trying to read an avro file in Jupyter notebook using pyspark. When I read the file i am getting an error. I have downloaded spark-avro_2.11:4.0.0.jar, i am not sure where in my code I should be inserting the avro package. Any suggestions…
Conz
  • 3
  • 2
0
votes
0 answers

Get object size from File

I have an avro file outputted from a spark job with some objects in it: Objavro.schema�{"type":"record","name":"topLevelRecord","fields": [{"name":"Name","type":["String","null"]},{"name":"Age","type": ["int","null"]}]} Is there a way to get…
ThatComputerGuy
  • 323
  • 3
  • 6
  • 11
0
votes
2 answers

Does size of part files play a role for Spark SQL performance

I am trying to query the hdfs which has lot of part files (avro). Recently we made a change to reduce parallelism and thus the size of part files have increased , the size of each of these part files are in the range of 750MB to 2 GB (we use spark…