Questions tagged [spark-avro]

A library for reading and writing Avro data from Spark SQL.

The GitHub page is here.

227 questions
2
votes
1 answer

How to read Avro schema from empty RDD?

I'm using the AvroKeyInputFormat to read avro files: val records = sc.newAPIHadoopFile[AvroKey[T], NullWritable, AvroKeyInputFormat[T]](path) .map(_._1.datum()) Because I need to reflect over the schema in my job, I get the Avro schema like…
Lukas Wegmann
  • 468
  • 5
  • 9
2
votes
1 answer

How to manually load spark-redshift AVRO files into Redshift?

I have a Spark job that failed at the COPY portion of the write. I have all the output already processed in S3, but am having trouble figuring out how to manually load it. COPY table FROM…
flybonzai
  • 3,763
  • 11
  • 38
  • 72
2
votes
2 answers

Non HBase solution for storing Huge data and updating on real time

Hi i have developed an application where i have to store TB of data for the first time and then 20 GB monthly incremental like insert/update/delete in the form of xml that will be applied on top of this 5 TB of data . And finally on request basis i…
Sudarshan kumar
  • 1,503
  • 4
  • 36
  • 83
2
votes
1 answer

Spark 1.6 load specific partition in dataframe keeping partition field

We have an avro with partitioned like this: table --a=01 --a=02 We want to load the data from a single partition keeping the partition column a. I found this stackoverflow question and I applied the suggested snippet: DataFrame df =…
Stefano Lazzaro
  • 387
  • 1
  • 4
  • 22
2
votes
1 answer

Explode Spark Daraframe Avro Map into flat format

I am using Spark Shell v_1.6.1.5. I have the following Spark Scala Dataframe: val data = sqlContext.read.avro("/my/location/*.avro") data.printSchema root |-- id: long (nullable = true) |-- stuff: map (nullable = true) | |-- key: string | …
Marsellus Wallace
  • 17,991
  • 25
  • 90
  • 154
2
votes
1 answer

Amazon EMR and S3, org.apache.spark.sql.AnalysisException: path s3://..../var/table already exists

I'm trying to find the source of a bug on Spark 2.0.0, I have a map that holds table names as keys and the dataframe as the value, I loop through it and at the end use spark-avro (3.0.0-preview2) to write everything to S3 directories. It runs…
2
votes
1 answer

Spark - Avro Reads Schema but DataFrame Empty

I am using Gobblin to periodically extract relational data from Oracle, convert it to avro and publish it to HDFS My dfs directory structure looks like this -tables | -t1 | -2016080712345 | -f1.avro | -2016070714345 | …
Brian
  • 7,098
  • 15
  • 56
  • 73
2
votes
2 answers

How to read/parse *only* the JSON schema from a file containing an avro message in binary format?

I have an avro message in binary format in a…
TakeSoUp
  • 7,457
  • 7
  • 16
  • 20
1
vote
0 answers

How to convert a spark dataframe schema to an avro schema using pyspark

Is there a pyspark function which could convert below _schema variable to an `avro schema? df_schema = spark.read.format('parquet').load(input_directory) _schema = df_schema.schema
Fabrice Jammes
  • 2,275
  • 1
  • 26
  • 39
1
vote
0 answers

Cannot read avro file which contains whitespace in columns name using Spark-Avro 2.12:3.2.0

After migrating to Spark 3.2.0 i had to upgrade the external package of spark-avro to spark-avro 2.12:3.2.0. After this migration i was unable to read any avro file that contains spaces in their column names. The errors occurs on the read method…
1
vote
0 answers

How to create pyspark session in Jupyter Notebook (under Dataproc Cluster) with avro datasource extension?

Via Concord, we can automatically spawn clusters with pyspark enabled dataproc clusters. In these pyspark notebooks, spark version is 2.4.8 But, by default spark does not have .avro datasource extension. Without Avro extension, we can not read .avro…
1
vote
0 answers

Does `from_avro` in pyspark take magic byte(4bytes) of avro byte data(from the kafka) into account?

I have streamed data in Avro format in Kafka storage and managed the schema of the data via the confluent schema registry. I'd like to pull the data using pyspark and parse the Avro byte data using schema from schema registry but it kept raising…
1
vote
0 answers

Can't deserialize avro file in spark

There is some problem trying to deserialize data from .avro file. My process consists of these steps: reading from Kafka df = ( spark.read.format("kafka") .option("kafka.security.protocol", "PLAINTEXT") …
1
vote
2 answers

Avro bytes from Event hub cannot be deserialized with pyspark

We are sending Avro data encoded with (azure.schemaregistry.encoder.avroencoder) to Event-Hub using a standalone python job and we can deserialize using the same decoder using another standalone python consumer. The schema registry is also supplied…
1
vote
0 answers

Running spark-submit with spark-avro installed locally on a Mac or PC

I am really struggling with this one. Spent a lot of time searching for an answer in Spark manual and stack-overflow posts. Really need help. I've installed Apache Spark on my mac to build and debug PySpark code locally. However, in my PySpark code…
bda
  • 372
  • 1
  • 7
  • 22