1

I am newbie to spark and am trying to load avro data to spark 'dataset' (spark 1.6) using java. I see some examples in scala but not in java. Any pointers to examples in java will be helpful. I tried to create a javaRDD and then convert it to 'dataset'. I believe there must be a straight forward way.

Shreeharsha
  • 914
  • 1
  • 10
  • 21
Pradeep
  • 850
  • 2
  • 14
  • 27
  • I also faced this problem actually, and I couldn't figure it out. Don't know how are you creating RDDs, but I was receiving them from Kafka without knowing the schema. So to create DataSet I had to change the format of sent data: instead of avro-serialized data a json-string. After that I simply used: `session.read().json(JavaRDD);`. Or if you still want to use avro, then I think the way is to put that in avro file, and `session.read().format("avro").load("avrofile.avro");` (not sure of the format string value though). Still hope there is some simple way, so will add question to favorites. – RadioLog Aug 22 '16 at 07:26
  • But maybe you'll find the appropriate for you example here http://spark.apache.org/docs/latest/sql-programming-guide.html. Just choose Java tab. – RadioLog Aug 22 '16 at 07:31
  • I was able to read avro data using Dataset df = spark.read().format("com.databricks.spark.avro") .load("users.avro"); where users.avro is the data file and User.avsc is the schema that i used. But I am not able to convert Dataset to Dataset. I tried Encoder UserEncoder = Encoders.bean(User.class); /*(User.class is the avro generated class) */ Dataset df = spark.read().format("com.databricks.spark.avro") .load("users.avro").as(UserEncoder); – Pradeep Aug 23 '16 at 18:11

1 Answers1

1

first of all you need to set hadoop.home.dir

System.setProperty("hadoop.home.dir", "C:/app/hadoopo273/winutils-master/hadoop-2.7.1");

then create a Spark session specifying where the avro file will be located

SparkSession spark = SparkSession .builder().master("local").appName("ASH").config("spark.cassandra.connection.host", "127.0.0.1").config("spark.sql.warehouse.dir", "file:///C:/cygwin64/home/a622520/dev/AshMiner2/cass-spark-embedded/cassspark/cassspark.all/spark-warehouse/").getOrCreate();

In my code am using an embedded spark environement

// Creates a DataFrame from a specified file
Dataset<Row> df = spark.read().format("com.databricks.spark.avro") .load("./Ash.avro");
df.createOrReplaceTempView("words");
Dataset<Row> wordCountsDataFrame = spark.sql("select count(*) as total from words");
wordCountsDataFrame.show();

hope this helps

n2o
  • 6,175
  • 3
  • 28
  • 46