Read avro data using spark dataset in java

Question

I am newbie to spark and am trying to load avro data to spark 'dataset' (spark 1.6) using java. I see some examples in scala but not in java. Any pointers to examples in java will be helpful. I tried to create a javaRDD and then convert it to 'dataset'. I believe there must be a straight forward way.

I also faced this problem actually, and I couldn't figure it out. Don't know how are you creating RDDs, but I was receiving them from Kafka without knowing the schema. So to create DataSet I had to change the format of sent data: instead of avro-serialized data a json-string. After that I simply used: `session.read().json(JavaRDD);`. Or if you still want to use avro, then I think the way is to put that in avro file, and `session.read().format("avro").load("avrofile.avro");` (not sure of the format string value though). Still hope there is some simple way, so will add question to favorites. — RadioLog, Aug 22 '16 at 07:26
But maybe you'll find the appropriate for you example here http://spark.apache.org/docs/latest/sql-programming-guide.html. Just choose Java tab. — RadioLog, Aug 22 '16 at 07:31
I was able to read avro data using Dataset df = spark.read().format("com.databricks.spark.avro") .load("users.avro"); where users.avro is the data file and User.avsc is the schema that i used. But I am not able to convert Dataset to Dataset. I tried Encoder UserEncoder = Encoders.bean(User.class); /*(User.class is the avro generated class) */ Dataset df = spark.read().format("com.databricks.spark.avro") .load("users.avro").as(UserEncoder); — Pradeep, Aug 23 '16 at 18:11

score 1 · Answer 1 · edited Oct 11 '16 at 12:25

first of all you need to set hadoop.home.dir

System.setProperty("hadoop.home.dir", "C:/app/hadoopo273/winutils-master/hadoop-2.7.1");

then create a Spark session specifying where the avro file will be located

SparkSession spark = SparkSession .builder().master("local").appName("ASH").config("spark.cassandra.connection.host", "127.0.0.1").config("spark.sql.warehouse.dir", "file:///C:/cygwin64/home/a622520/dev/AshMiner2/cass-spark-embedded/cassspark/cassspark.all/spark-warehouse/").getOrCreate();

In my code am using an embedded spark environement

// Creates a DataFrame from a specified file
Dataset<Row> df = spark.read().format("com.databricks.spark.avro") .load("./Ash.avro");
df.createOrReplaceTempView("words");
Dataset<Row> wordCountsDataFrame = spark.sql("select count(*) as total from words");
wordCountsDataFrame.show();

hope this helps

Read avro data using spark dataset in java

1 Answers1