I am trying to read avro file which is encoded in Binary(Base64) and snappy compressed Hadoop cat on the avro file looks like:
Objavro.schema?
{"type":"record","name":"ConnectDefault","namespace":"xyz.connect.avro","fields":
[{"name":"service","type":"string"},{"name":"timestamp","type":"long"},
{"name":"count","type":"int"},{"name":"encoderKey","type":{"type":"map","values":"string"}},
{"name":"schema","type":"string"},{"name":"data","type":"string"}]}>??n]
I need to extract and read the "schema" and "data" from the above file. The "schema" is assosiated with the "data" which has multiple fileds
I tried below steps:
1.Reading the binary file
val binaryFilesRDD = sc.binaryFiles("file+0+00724+00731.avro").map { x => ( x._2.toArray) }
binaryFilesRDD: org.apache.spark.rdd.RDD[Array[Byte]] = MapPartitionsRDD[1] at map at
<console>:24
- Converting the RDD[Array[Byte]] into Array[Byte]
scala> val newArray = binaryFilesRDD.collect().flatten
newArray: Array[Byte] = Array(17, 18, 16, 51, 24, 22, 17, 18, 117, 151, 76, 105, 95, 124....
- Calling the following method using newArray (i.e. Array[Byte] ) to get Records from Bytes
def getGenericRecordfromByte(inputData:Array[Byte], inputDataSchema: Schema): GenericRecord =
{
val datareader = new GenericDatumReader[GenericRecord](inputDataSchema)
val datadecoder = DecoderFactory.get.binaryDecoder(inputData, null)
datareader.read(null, datadecoder)
}
But I am getting following errors.
scala> val newDataRecords = getGenericRecordfromByte(newArray,inputDataSchema)
org.apache.avro.AvroRuntimeException: Malformed data. Length is negative: -40
at org.apache.avro.io.BinaryDecoder.doReadBytes(BinaryDecoder.java:336)
at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:263)
at org.apache.avro.io.ResolvingDecoder.readString(ResolvingDecoder.java:201)
at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:363)
at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:355)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:157)
at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
Please advice