1

When saving an RDD to S3 in AVRO, I get the following warning in the console:

Using standard FileOutputCommitter to commit work. This is slow and potentially unsafe.

I haven't been able to find a simple implicit such as saveAsAvroFile and therefore I've dug around and came to this:

import org.apache.avro.Schema
import org.apache.avro.mapred.AvroKey
import org.apache.avro.mapreduce.{AvroJob, AvroKeyOutputFormat}
import org.apache.hadoop.io.NullWritable
import org.apache.hadoop.mapreduce.Job
import org.apache.spark.rdd.RDD

object AvroUtil {

  def write[T](
      path: String,
      schema: Schema,
      avroRdd: RDD[T],
      job: Job = Job.getInstance()): Unit = {
    val intermediateRdd = avroRdd.mapPartitions(
      f = (iter: Iterator[T]) => iter.map(new AvroKey(_) -> NullWritable.get()),
      preservesPartitioning = true
    )

    job.getConfiguration.set("avro.output.codec", "snappy")
    job.getConfiguration.set("mapreduce.output.fileoutputformat.compress", "true")

    AvroJob.setOutputKeySchema(job, schema)

    intermediateRdd.saveAsNewAPIHadoopFile(
      path,
      classOf[AvroKey[T]],
      classOf[NullWritable],
      classOf[AvroKeyOutputFormat[T]],
      job.getConfiguration
    )
  }
}

I'm rather baffled as I don't see what is incorrect because the AVRO files seem to be outputted correctly.

Mridang Agarwalla
  • 43,201
  • 71
  • 221
  • 382
  • Why not write a Dataframe with the spark-avro library? – OneCricketeer Apr 07 '21 at 01:25
  • @OneCricketeer are you referring to this? https://github.com/databricks/spark-avro It seems to be flagged as deprecated. Our codebase is reliant on low-level RDDs. Any chance you could post an example, please? Thx. – Mridang Agarwalla Apr 07 '21 at 16:47
  • 2
    Yes. That library was merged upstream https://spark.apache.org/docs/latest/sql-data-sources-avro.html and you'd need to convert your RDD using toDF function https://stackoverflow.com/questions/38968351/spark-2-0-scala-rdd-todf – OneCricketeer Apr 08 '21 at 02:01

1 Answers1

0

You can override behaviour of existing FileOutputCommitter by implementing own OutputFileCommitter to make it more efficient and safe.

Follow this link where author has explained similar with example.

saurabh2208
  • 72
  • 1
  • 1
  • 7