0

I am trying to read Avro files from S3 and as shown in this spark documentation I am able to read it fine. My files are like below, these files consist of 5000 record each.

s3a://bucket/part-0.avro
s3a://bucket/part-1.avro
s3a://bucket/part-2.avro

val byteRDD: RDD[Array[Byte]] = sc.binaryFiles(s"$s3URL/*.avro").map{ case(file, pds) => {
  val dis = pds.open()
  val len = dis.available()
  val buf = Array.ofDim[Byte](len)
  pds.open().readFully(buf)
  buf
}}

import org.apache.avro.io.DecoderFactory
val deserialisedAvroRDD = byteRDD.map(record => {

  import org.apache.avro.Schema
  val schema = new Schema.Parser().parse(schemaJson)
  val datumReader = new GenericDatumReader[GenericRecord](schema)

  val decoder = DecoderFactory.get.binaryDecoder(record, null)
  var datum: GenericRecord = null
  while (!decoder.isEnd()) {
    datum = datumReader.read(datum, decoder)
  }
  datum
}
)

deserialisedAvroRDD.count() ---> 3

I am deserializing the binaryAvro messages to generate GenericRecords and I was expecting the deserilized RDD to have 15k records as each .avro file had 5k record however after deserializing I only get 3 record. Can someone please help in finding out the issue with my code? How can I serialize one record at a time.

Explorer
  • 1,491
  • 4
  • 26
  • 67
  • Does this answer your question? [Reading Avro File in Spark](https://stackoverflow.com/questions/45360359/reading-avro-file-in-spark) – moon May 12 '20 at 23:38
  • These are Binary Avros i.e Array[Bytes] – Explorer May 12 '20 at 23:46
  • The problem is probably in `byteRDD` reading. It doesn't know when a record starts and stops. Is there a reason to split the ops in 2 different steps? Why not use `binaryDecoder` to read `pds`? – moon May 13 '20 at 04:35
  • Why not use binaryDecoder to read pds? -> Can you point to an example for this? – Explorer May 13 '20 at 11:26

1 Answers1

1

This should work

val recRDD: RDD[GenericRecord] = sc.binaryFiles(s"$s3URL/*.avro").flatMap {
  case (file, pds) => {
    val schema =  new Schema.Parser().parse(schemaJson)
    val datumReader = new GenericDatumReader[GenericRecord](schema)

    val decoder = DecoderFactory.get.binaryDecoder(pds.toArray(), null)
    var datum: GenericRecord = null
    val out = ArrayBuffer[GenericRecord]()
    while (!decoder.isEnd()) {
      out += datumReader.read(datum, decoder)
    }
    out
  }
}
moon
  • 1,702
  • 3
  • 19
  • 35