Force schema using spark write

Question

I have an encrypted data in avro format which has the following schema

{"type":"record","name":"ProtectionWrapper","namespace":"com.security","fields": 
[{"name":"protectionInfo","type":["null",{"type":"record","name":"ProtectionInfo","fields": 
[{"name":"unprotected","type":"boolean"}]}]}],
"writerSchema":"{"type":"record","name":"Demo","namespace":"com.demo","fields": 
[{"name":"id","type":"string"}]}"}

Here "writerSchema" is the schema of data before encryption. The data has to be written with the writer schema so that the decrypt function uses it while decrypting. When I use the below code, writer schema is written along with data.

Job mrJob = org.apache.hadoop.mapreduce.Job.getInstance(JavaSparkContext.hadoopConfiguration());
AvroJob.setDataModelClass(mrJob, SpecificData.class);
AvroJob.setOutputKeySchema(mrJob, protectionSchema) // schema shown above
JavaPairRDD<AvroKey<GenericRecord>, NullWritable> encryptedData = encryptionMethod();
encryptedData.saveAsNewAPIHadoopFile("c:\\test", AvroKey.class, NullWritable.class, 
AvroKeyOutputFormat.class, mrJob.getConfiguration());

But if I try to convert the schema to struct Type and write using spark, the writer schema doesn't go with the data.

StructType type = (StructType)SchemaConverters.toSqlType(protectionSchema).dataType();
Dataset<Row> ds = SparkSession.createDataFrame(rdd, type);
ds.write();

Is it possible to achieve the same using spark write without having to use saveAsNewAPIHadoopFile() method.

Force schema using spark write

0 Answers0