2

Can somebody share example of reading avro using java in spark? Found scala examples but no luck with java. Here is the code snippet which is part of code but running into compilation issues with the method ctx.newAPIHadoopFile.

JavaSparkContext ctx = new JavaSparkContext(sparkConf);
Configuration hadoopConf = new Configuration();
JavaRDD<SampleAvro> lines = ctx.newAPIHadoopFile(path, AvroInputFormat.class, AvroKey.class, NullWritable.class, new Configuration());

Regards

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
kre
  • 51
  • 1
  • 6
  • Could you please share more information about the compilation issues you are running into? Errors, stack trace, etc. – Jordan Pilat Jan 25 '16 at 20:33
  • Its giving compilation error saying that expected java.lang.class actual is AvroInputFormat.class and the same for rest of the arguments except path,hadoopConf . Any help where i am getting wrong? Thanks – kre Jan 26 '16 at 11:20

2 Answers2

2

You can use the spark-avro connector library by Databricks.
The recommended way to read or write Avro data from Spark SQL is by using Spark's DataFrame APIs.

The connector enables both reading and writing Avro data from Spark SQL:

import org.apache.spark.sql.*;

SQLContext sqlContext = new SQLContext(sc);

// Creates a DataFrame from a specified file
DataFrame df = sqlContext.read().format("com.databricks.spark.avro")
    .load("src/test/resources/episodes.avro");

// Saves the subset of the Avro records read in
df.filter($"age > 5").write()
    .format("com.databricks.spark.avro")
    .save("/tmp/output");

Note that this connector has different versions for Spark 1.2, 1.3, and 1.4+:

Spark verconnector
1.2          0.2.0      
1.3          1.0.0      
1.4+        2.0.1      

Using Maven:

<dependency>
    <groupId>com.databricks</groupId>
    <artifactId>spark-avro_2.10</artifactId>
    <version>{AVRO_CONNECTOR_VERSION}</version>
</dependency>

See further info at: Spark SQL Avro Library

Leet-Falcon
  • 2,107
  • 2
  • 15
  • 23
  • Any idea how one might do it via the Hadoop InputFormat API, in Java? – Jordan Pilat Jan 25 '16 at 22:09
  • @Jordan - Try this: http://stackoverflow.com/questions/5480308/getting-started-with-avro – Leet-Falcon Jan 26 '16 at 06:20
  • @Jordan - And I think this: https://github.com/apache/avro/tree/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapreduce – Leet-Falcon Jan 26 '16 at 06:21
  • @leet_Falcon, Thanks, I have already tried Spark SQL with Avro but no luck so far. The below error message what i am getting Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.sql.SQLContext.read()Lorg/apache/spark/sql/DataFrameReader; at org.opencb.hpg.bigdata.tools.sparkanalytics.SaprkSQLAvro.main(SaprkSQLAvro.java:19). – kre Jan 26 '16 at 11:36
  • @kre - Do you use Spark 1.4+ ? – Leet-Falcon Jan 26 '16 at 12:26
  • @Leet-Falcon yes. this is my pom file org.apache.spark spark-core_2.10 1.6.0 org.apache.spark spark-sql_2.10 1.4.0 com.databricks spark-avro_2.10 2.0.1 – kre Jan 26 '16 at 15:19
  • @kre - Spark-core & spark-sql should have the same version. spark-avro is ok – Leet-Falcon Jan 27 '16 at 10:30
1

Here, assuming K is your Key and V is your value:

....

val job = new Job();

job.setInputFormatClass(AvroKeyValueInputFormat<K, V>.class);

FileInputFormat.addInputPaths(job, <inputPaths>);
AvroJob.setInputKeySchema(job, <keySchema>);
AvroJob.setInputValueSchema(job, <valueSchema>);

RDD<AvroKey<K>, AvroValue<V>> avroRDD = 
  sc.newAPIHadoopRDD(job.getConfiguration,
  AvroKeyValueInputFormat<<K>, <V>>,
  AvroKey<K>.class,
  AvroValue<V>.class);
minyo
  • 142
  • 12