I have some legacy data in S3 which I want to convert to parquet format using Spark 2 using the Java API.
I have the desired Avro schema (.avsc files) and their generated Java classes using the Avro compiler and I want to store the data using those schema in Parquet format. The input data is not in any standard format but I have a library that can convert each line from the legacy files into Avro classes.
Is it possible to read the data as an JavaRDD<String>
, the apply the conversion to the Avro classes using the library and finally store it in parquet format.
Something like:
JavaRDD<String> rdd = javaSparkContext.textFile("s3://bucket/path_to_legacy_files");
JavaRDD<MyAvroClass> converted = rdd.map(line -> customLib.convertToAvro(line));
converted.saveAsParquet("s3://bucket/destination"); //how do I do this
Is something like the above feasible? I would later want to process the converted parquet data using Hive, Presto as well as Spark.