Converting data into Parquet in Spark

Question

I have some legacy data in S3 which I want to convert to parquet format using Spark 2 using the Java API.

I have the desired Avro schema (.avsc files) and their generated Java classes using the Avro compiler and I want to store the data using those schema in Parquet format. The input data is not in any standard format but I have a library that can convert each line from the legacy files into Avro classes.

Is it possible to read the data as an JavaRDD<String>, the apply the conversion to the Avro classes using the library and finally store it in parquet format.

Something like:

JavaRDD<String> rdd = javaSparkContext.textFile("s3://bucket/path_to_legacy_files");    
JavaRDD<MyAvroClass> converted = rdd.map(line -> customLib.convertToAvro(line));    
converted.saveAsParquet("s3://bucket/destination"); //how do I do this

Is something like the above feasible? I would later want to process the converted parquet data using Hive, Presto as well as Spark.

Search for the Spark Summit pres. by Steve Loughran (Horton) about "object stores"... — Samson Scharfrichter, Jan 18 '17 at 08:43
@SamsonScharfrichter Doesn't answer my question. The only remotely related stuff I saw was how he converted some csv data into Parquet. He use the sparkSession.csv() call to load the data which I cannot since I need to use a custom deserializer. — Swaranga Sarma, Jan 18 '17 at 13:17
So, what is your **actual** question? Is it about converting a custom `JavaRDD` to a regular DataFrame? About saving your custom stuff into Parquet format? About saving that to S3 object storage? About a way to read back your custom stuff with another tool that has no idea what a RDD is? A combination of the above? — Samson Scharfrichter, Jan 18 '17 at 17:05
@SamsonScharfrichter The question is basically how do I convert some non-standard data to parquet. I have, at my disposal, a Spark 2.0 cluster, Avro schema definitions, and a Java library that can convert the records from the legacy non-standard format to an instance of the Avro class. The code snippet was just a thought, asking could something like that be done. — Swaranga Sarma, Jan 18 '17 at 20:02

score 1 · Answer 1 · answered Jan 23 '17 at 11:20

Ignore S3 for now; that's a production detail. You need to start on the simpler problem "convert a local file in my format to a standard one". This is something you can implement locally with unit tests against a small sampleset of the data.

This is is generally the same in Spark as Hadoop Mapreduce: implement a subclass of InputFormat<K, V> or FileInputFormat<K, V>, or use Hadoop's org.apache.hadoop.streaming.mapreduce.StreamInputFormat input format, implement your own RecordReader, then set the option spark.hadoop.stream.recordreader.class to the classname of your record reader (probably the easiest).

There's lots of documentation on this, as well as stack overflow questions. And lots of examples in the source trees itself.

score 0 · Answer 2 · answered Jan 23 '17 at 15:55

Figured it out, basically the approach mentioned by Steve, except that the Hadoop input and output formats already exists:

         Job job = new Job();
         ParquetOutputFormat.setWriteSupportClass(job, AvroWriteSupport.class);
         AvroParquetOutputFormat.setSchema(job, MyAvroType.SCHEMA$);
         AvroParquetOutputFormat.setBlockSize(job, 128*1024*1024);
         AvroParquetOutputFormat.setCompression(job, CompressionCodecName.SNAPPY);
         AvroParquetOutputFormat.setCompressOutput(job, true);

         sparkContext.textFile("s3://bucket/path_to_legacy_files")
            .map(line -> customLib.convertToAvro(line))
            .mapToPair(record -> new Tuple2<Void, MyAvroType>(null, record))
            .saveAsNewAPIHadoopFile(
                "s3://bucket/destination", 
                Void.class, 
                MyAvroType.class,
                new ParquetOutputFormat<MyAvroType>().getClass(), 
                job.getConfiguration());

Converting data into Parquet in Spark

2 Answers2