I am using Structured Streaming (Spark 2.4.0) to read avro mesages through kafka and using
Confluent schema-Registry to receive/read schema
I am unable to access the deeply nested fields.
Schema looks like this in compacted avsc…
I am trying to build spark from master branch with ./build/sbt clean package
I want to test something specific to spark-avro submodule. However when I run the ./bin/spark-shell and try:
scala> import org.apache.spark.sql.avro._
I receive object avro…
Any file write attempt of Avro format fails with the stack trace below.
We are using Spark 2.4.3 (with user provided Hadoop), Scala 2.12, and we load the Avro package at runtime with either spark-shell:
spark-shell --packages…
I am reading an avro file from Azure data lake using databricks and I am using this path to read current date file for daily run, the code to drive the file date looks like this and it gets the current date fine.
val pfdtm =…
I have created 2 tables in Hive
CREATE external TABLE avro1(id INT,name VARCHAR(64),dept VARCHAR(64)) PARTITIONED BY (yoj VARCHAR(64)) STORED AS avro;
CREATE external TABLE avro2(id INT,name VARCHAR(64),dept VARCHAR(64)) PARTITIONED BY (yoj…
I've got data in Avro format partitioned by date and time and I receiving new data every hour. Newer partitions can contain more columns then older ones. When I read it by Spark 2.4.3 I got DataFrame with schema of the first(oldest) partition and…
I'm converting .avro files to JSON format, then parsing specific data items to be indexed on my elastic search cluster. Each chunk contains roughly 1.8 gigabytes of data and there are about 500 chunks. It doesn't take long to run out of memory, but…
I'm reading an .avro file where the data of a particular column is in binary format. I'm currently converting the binary format to string format with the help of UDF for a readable purpose and then finally i will need to convert it into JSON format…
When i query in Apache drill for an .avro file, i'm getting the Body column values correctly as shown below snapshot. But if i do the same in Spark-SQL, Body column values are coming in a binary format. Is there a way i can read the data correctly…
I have a requirement where I need to mask the data for some of the fields in a given schema. I've researched a lot and couldn't find the answer that is needed.
This is the schema where I need some changes on the fields(answer_type,response0,…
The project I am working receive data in the form a Avro files. I am writing a generic code in Scala (using Scala IDE) to read all Avro files present in a folder and create a table for each of the avro file.
I am reading the avro file data as a…
I am trying to read an avro file in Jupyter notebook using pyspark. When I read the file i am getting an error.
I have downloaded spark-avro_2.11:4.0.0.jar, i am not sure where in my code I should be inserting the avro package. Any suggestions…
I have an avro file outputted from a spark job with some objects in it:
Objavro.schema�{"type":"record","name":"topLevelRecord","fields":
[{"name":"Name","type":["String","null"]},{"name":"Age","type":
["int","null"]}]}
Is there a way to get…
I am trying to query the hdfs which has lot of part files (avro). Recently we made a change to reduce parallelism and thus the size of part files have increased , the size of each of these part files are in the range of 750MB to 2 GB (we use spark…