Questions tagged [spark-avro]

A library for reading and writing Avro data from Spark SQL.

The GitHub page is here.

227 questions
1
vote
1 answer

How to convert bytes column (with logicaltype as decimal) in Avro to decimal?

I have a decimal column "TOT_AMT" defined as type "bytes" and logical type "decimal" in my avro schema. After creating the data frame in spark using databricks spark-avro, when I tried to sum the TOT_AMT column using the sum function it throws…
Anand B
  • 57
  • 3
  • 8
1
vote
1 answer

Enabling Compression on Avro via PySpark

Using PySpark I'm trying to save an Avro file with compression (preferably snappy). This line of code successfully saves a 264MB file: df.write.mode('overwrite').format('com.databricks.spark.avro').save('s3n://%s:%s@%s/%s' % (access_key,…
Frank B.
  • 1,813
  • 5
  • 24
  • 44
1
vote
0 answers

StackOverflowError while loading Avro file to create a Dataframe

I am running into this error on trying to load a Avro file (size 134 KB).My pom dependencies are below. I am creating this Avro from a protobuf message which works fine. pom dependencies…
Nitin Kumar
  • 219
  • 2
  • 10
1
vote
0 answers

How to Read a large avro file

I am trying to read a large avro file (2GB) using spark-shell but I am getting stackoverflow error. val newDataDF = spark.read.format("com.databricks.spark.avro").load("abc.avro") java.lang.StackOverflowError at…
PrinceChamp
  • 41
  • 1
  • 3
1
vote
0 answers

Not able to convert the byte[] to string in scala

**I'm trying to stream the data from kafka and convert it in to a data frame.followed this link But when im running both producer and consumer applications, this is the output on my console.** (0,[B@370ed56a) (1,[B@2edd3e63) (2,[B@3ba2944d)…
1
vote
0 answers

Avro - code-generation approach vs non-code generation approach

I'm new to Avro. The official documentation indicates that there are two possible approaches to using avro; With code generation - here classes are auto-generated from avro schema files by the avro compiler. These classes are then used in the…
jithinpt
  • 1,204
  • 2
  • 16
  • 33
1
vote
0 answers

read bq table by AvroBigQueryInputFormat from spark give unexpected behavior (using java)

An sample skeleton code is sort of as follows, where i am basically reading a RDD from bigquery and select out all data point where my_field_name value is null JavaPairRDD input = sc …
Xinwei Liu
  • 333
  • 6
  • 15
1
vote
1 answer

Pyspark + Hive avro table

I created Hive avro table, and trying to read it from pyspark. Basically trying to run basic query over this Hive avro table on pyspark in order to do some analysis. from pyspark import SparkContext from pyspark.sql import HiveContext hive_context…
SuWon
  • 23
  • 1
  • 7
1
vote
1 answer

Read avro data using spark dataset in java

I am newbie to spark and am trying to load avro data to spark 'dataset' (spark 1.6) using java. I see some examples in scala but not in java. Any pointers to examples in java will be helpful. I tried to create a javaRDD and then convert it to…
Pradeep
  • 850
  • 2
  • 14
  • 27
1
vote
2 answers

Bootstrapping spark-avro jar to Amazon EMR cluster

I want to read avro files located in Amazon S3 from the Zeppelin notebook. I understand Databricks has a wonderful package for it spark-avro. What are the steps that I need to take in order to bootstrap this jar file to my cluster and make it…
van_d39
  • 725
  • 2
  • 14
  • 28
1
vote
1 answer

Spark changes the schema when writing to Avro

I have a Spark job (in CDH 5.5.1) that loads two Avro files (both with the same schema), combines them to make a DataFrame (also with the same schema) then writes them back out to Avro. The job explicitly compares the two input schemas to ensure…
DNA
  • 42,007
  • 12
  • 107
  • 146
1
vote
2 answers

NoClassDefFoundError when using avro in spark-shell

I keep getting java.lang.NoClassDefFoundError: org/apache/avro/mapred/AvroWrapper when calling show() on a DataFrame object. I'm attempting to do this through the shell (spark-shell --master yarn). I can see that the shell recognizes the schema…
Pudge
  • 98
  • 1
  • 6
1
vote
2 answers

How to serialize the data to AVRO schema in Spark (with Java)?

I have defined an AVRO schema, and generated some classes with avro-tools for the schemes. Now, I want to serialize the data to disk. I found some answers about scala for this, but not for Java. The class Article is generated with avro-tools, and is…
Belphegor
  • 4,456
  • 11
  • 34
  • 59
1
vote
1 answer

java.lang.NoClassDefFoundError: com/databricks/spark/avro/package$

I am using spark 1.3.0 and spark-avro 1.0.0. my build.sbt file looks like libraryDependencies ++=Seq( "org.apache.spark" % "spark-core_2.10" % "1.3.0" % "provided", "org.apache.spark" % "spark-sql_2.10" % "1.5.2" % "provided", "com.databricks"…
Knows Not Much
  • 30,395
  • 60
  • 197
  • 373
0
votes
0 answers

Use Avro model from different package in different repository

I have not common problem I have repository X which contains avro model called Person. In my repository Y, I would like to create a new model with property of type Person from repository X. Is it even possible? I have imported X artifact to Y but it…
Mati
  • 1
  • 1