I am newbie to spark and am trying to load avro data to spark 'dataset' (spark 1.6) using java. I see some examples in scala but not in java. Any pointers to examples in java will be helpful. I tried to create a javaRDD and then convert it to 'dataset'. I believe there must be a straight forward way.
Asked
Active
Viewed 2,960 times
1
-
I also faced this problem actually, and I couldn't figure it out. Don't know how are you creating RDDs, but I was receiving them from Kafka without knowing the schema. So to create DataSet I had to change the format of sent data: instead of avro-serialized data a json-string. After that I simply used: `session.read().json(JavaRDD);`. Or if you still want to use avro, then I think the way is to put that in avro file, and `session.read().format("avro").load("avrofile.avro");` (not sure of the format string value though). Still hope there is some simple way, so will add question to favorites. – RadioLog Aug 22 '16 at 07:26
-
But maybe you'll find the appropriate for you example here http://spark.apache.org/docs/latest/sql-programming-guide.html. Just choose Java tab. – RadioLog Aug 22 '16 at 07:31
-
I was able to read avro data using Dataset
df = spark.read().format("com.databricks.spark.avro") .load("users.avro"); where users.avro is the data file and User.avsc is the schema that i used. But I am not able to convert Dataset
– Pradeep Aug 23 '16 at 18:11to Dataset
. I tried Encoder UserEncoder = Encoders.bean(User.class); /*(User.class is the avro generated class) */ Dataset df = spark.read().format("com.databricks.spark.avro") .load("users.avro").as(UserEncoder);
1 Answers
1
first of all you need to set hadoop.home.dir
System.setProperty("hadoop.home.dir", "C:/app/hadoopo273/winutils-master/hadoop-2.7.1");
then create a Spark session specifying where the avro file will be located
SparkSession spark = SparkSession .builder().master("local").appName("ASH").config("spark.cassandra.connection.host", "127.0.0.1").config("spark.sql.warehouse.dir", "file:///C:/cygwin64/home/a622520/dev/AshMiner2/cass-spark-embedded/cassspark/cassspark.all/spark-warehouse/").getOrCreate();
In my code am using an embedded spark environement
// Creates a DataFrame from a specified file
Dataset<Row> df = spark.read().format("com.databricks.spark.avro") .load("./Ash.avro");
df.createOrReplaceTempView("words");
Dataset<Row> wordCountsDataFrame = spark.sql("select count(*) as total from words");
wordCountsDataFrame.show();
hope this helps

n2o
- 6,175
- 3
- 28
- 46

Patrick Boulay
- 76
- 1
- 2