What FileOutputCommitter should be used in when writing AVRO files in Spark?

Question

When saving an RDD to S3 in AVRO, I get the following warning in the console:

Using standard FileOutputCommitter to commit work. This is slow and potentially unsafe.

I haven't been able to find a simple implicit such as saveAsAvroFile and therefore I've dug around and came to this:

import org.apache.avro.Schema
import org.apache.avro.mapred.AvroKey
import org.apache.avro.mapreduce.{AvroJob, AvroKeyOutputFormat}
import org.apache.hadoop.io.NullWritable
import org.apache.hadoop.mapreduce.Job
import org.apache.spark.rdd.RDD

object AvroUtil {

  def write[T](
      path: String,
      schema: Schema,
      avroRdd: RDD[T],
      job: Job = Job.getInstance()): Unit = {
    val intermediateRdd = avroRdd.mapPartitions(
      f = (iter: Iterator[T]) => iter.map(new AvroKey(_) -> NullWritable.get()),
      preservesPartitioning = true
    )

    job.getConfiguration.set("avro.output.codec", "snappy")
    job.getConfiguration.set("mapreduce.output.fileoutputformat.compress", "true")

    AvroJob.setOutputKeySchema(job, schema)

    intermediateRdd.saveAsNewAPIHadoopFile(
      path,
      classOf[AvroKey[T]],
      classOf[NullWritable],
      classOf[AvroKeyOutputFormat[T]],
      job.getConfiguration
    )
  }
}

I'm rather baffled as I don't see what is incorrect because the AVRO files seem to be outputted correctly.

@OneCricketeer are you referring to this? https://github.com/databricks/spark-avro It seems to be flagged as deprecated. Our codebase is reliant on low-level RDDs. Any chance you could post an example, please? Thx. — Mridang Agarwalla, Apr 07 '21 at 16:47
Yes. That library was merged upstream https://spark.apache.org/docs/latest/sql-data-sources-avro.html and you'd need to convert your RDD using toDF function https://stackoverflow.com/questions/38968351/spark-2-0-scala-rdd-todf — OneCricketeer, Apr 08 '21 at 02:01

score 0 · Answer 1 · answered Apr 15 '21 at 15:44

0

You can override behaviour of existing FileOutputCommitter by implementing own OutputFileCommitter to make it more efficient and safe.

Follow this link where author has explained similar with example.

answered Apr 15 '21 at 15:44

saurabh2208

72
1
1
7

What FileOutputCommitter should be used in when writing AVRO files in Spark?

1 Answers1