I am trying to read Avro files from S3 and as shown in this spark documentation I am able to read it fine. My files are like below, these files consist of 5000 record each.
s3a://bucket/part-0.avro
s3a://bucket/part-1.avro
s3a://bucket/part-2.avro
val byteRDD: RDD[Array[Byte]] = sc.binaryFiles(s"$s3URL/*.avro").map{ case(file, pds) => {
val dis = pds.open()
val len = dis.available()
val buf = Array.ofDim[Byte](len)
pds.open().readFully(buf)
buf
}}
import org.apache.avro.io.DecoderFactory
val deserialisedAvroRDD = byteRDD.map(record => {
import org.apache.avro.Schema
val schema = new Schema.Parser().parse(schemaJson)
val datumReader = new GenericDatumReader[GenericRecord](schema)
val decoder = DecoderFactory.get.binaryDecoder(record, null)
var datum: GenericRecord = null
while (!decoder.isEnd()) {
datum = datumReader.read(datum, decoder)
}
datum
}
)
deserialisedAvroRDD.count() ---> 3
I am deserializing the binaryAvro messages to generate GenericRecords and I was expecting the deserilized RDD to have 15k records as each .avro file had 5k record however after deserializing I only get 3 record. Can someone please help in finding out the issue with my code? How can I serialize one record at a time.