1

I am new to Scala, Spark and so struggling with a map function I am trying to create. The map function on the Dataframe a Row (org.apache.spark.sql.Row) I have been loosely following this article.

val rddWithExceptionHandling = filterValueDF.rdd.map { row: Row =>
    val parsed = Try(from_avro(???, currentValueSchema.value, fromAvroOptions)) match {
        case Success(parsedValue) => List(parsedValue, null)
        case Failure(ex) => List(null, ex.toString)
    }
    Row.fromSeq(row.toSeq.toList ++ parsed)
}

The from_avro function wants to accept a Column (org.apache.spark.sql.Column), however I don't see a way in the docs to get a column from a Row.

I am fully open to the idea that I may be doing this whole thing wrong. Ultimately my goal is to parse the bytes coming in from a Structure Stream. Parsed records get written to a Delta Table A and the failed records to another Delta Table B

For context the source table looks as follows:

enter image description here

Edit - from_avro returning null on "bad record"

There have been a few comments saying that from_avro returns null if it fails to parse a "bad record". By default from_avro uses mode FAILFAST which will throw an exception if parsing fails. If one sets the mode to PERMISSIVE an object in the shape of the schema is returned but with all properties being null (also not particularly useful...). Link to the Apache Avro Data Source Guide - Spark 3.1.1 Documentation

Here is my original command:

val parsedDf = filterValueDF.select($"topic", 
                                    $"partition", 
                                    $"offset", 
                                    $"timestamp", 
                                    $"timestampType", 
                                    $"valueSchemaId", 
                                    from_avro($"fixedValue", currentValueSchema.value, fromAvroOptions).as('parsedValue))

If there are ANY bad rows the job is aborted with org.apache.spark.SparkException: Job aborted.

A snippet of the log of the exception:

Caused by: org.apache.spark.SparkException: Malformed records are detected in record parsing. Current parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'.
    at org.apache.spark.sql.avro.AvroDataToCatalyst.nullSafeEval(AvroDataToCatalyst.scala:111)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:732)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$2(FileFormatWriter.scala:291)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1615)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:300)
    ... 10 more
    Suppressed: java.lang.NullPointerException
        at shaded.databricks.org.apache.hadoop.fs.azure.NativeAzureFileSystem$NativeAzureFsOutputStream.write(NativeAzureFileSystem.java:1099)
        at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
        at java.io.DataOutputStream.write(DataOutputStream.java:107)
        at org.apache.parquet.hadoop.util.HadoopPositionOutputStream.write(HadoopPositionOutputStream.java:50)
        at shaded.parquet.org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:145)
        at shaded.parquet.org.apache.thrift.transport.TTransport.write(TTransport.java:107)
        at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeByteDirect(TCompactProtocol.java:482)
        at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeByteDirect(TCompactProtocol.java:489)
        at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeFieldBeginInternal(TCompactProtocol.java:252)
        at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeFieldBegin(TCompactProtocol.java:234)
        at org.apache.parquet.format.InterningProtocol.writeFieldBegin(InterningProtocol.java:74)
        at org.apache.parquet.format.FileMetaData$FileMetaDataStandardScheme.write(FileMetaData.java:1184)
        at org.apache.parquet.format.FileMetaData$FileMetaDataStandardScheme.write(FileMetaData.java:1051)
        at org.apache.parquet.format.FileMetaData.write(FileMetaData.java:949)
        at org.apache.parquet.format.Util.write(Util.java:222)
        at org.apache.parquet.format.Util.writeFileMetaData(Util.java:69)
        at org.apache.parquet.hadoop.ParquetFileWriter.serializeFooter(ParquetFileWriter.java:757)
        at org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:750)
        at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:135)
        at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
        at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:58)
        at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.abort(FileFormatDataWriter.scala:84)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$3(FileFormatWriter.scala:297)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1626)
        ... 11 more
Caused by: java.lang.ArithmeticException: Unscaled value too large for precision
    at org.apache.spark.sql.types.Decimal.set(Decimal.scala:83)
    at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:577)
    at org.apache.spark.sql.avro.AvroDeserializer.createDecimal(AvroDeserializer.scala:308)
    at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$newWriter$16(AvroDeserializer.scala:177)
    at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$newWriter$16$adapted(AvroDeserializer.scala:174)
    at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$1(AvroDeserializer.scala:336)
    at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$1$adapted(AvroDeserializer.scala:332)
    at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$2(AvroDeserializer.scala:354)
    at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$2$adapted(AvroDeserializer.scala:351)
    at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$converter$3(AvroDeserializer.scala:75)
    at org.apache.spark.sql.avro.AvroDeserializer.deserialize(AvroDeserializer.scala:89)
    at org.apache.spark.sql.avro.AvroDataToCatalyst.nullSafeEval(AvroDataToCatalyst.scala:101)
    ... 16 more

Oliver
  • 35,233
  • 12
  • 66
  • 78
  • Not sure I understand exactly your use case, but I would try to stay in the Dataframe (not converting it to RDD) and just apply the `from_avro` method based on the column `fixedValue` and a given Schema. If the parsing does not work, the from_avro function should return a null value. That means, you can then just filter your Dataframe based on this null value and write them into Delta Table B whereas you send the other part of the filter outcome to Delta Table A. – Michael Heil Apr 09 '21 at 14:16
  • @mike Your suggestion is what I am currently doing. However, if `from_avro` encounters a row it cannot parse it does NOT return null, it fails the whole streaming job. – Oliver Apr 13 '21 at 09:19
  • See updated answer @mike – Oliver Apr 13 '21 at 10:02
  • 1
    I see the behavior you are quoting is when the mode PERMISSIVE which is not the default behavior: https://spark.apache.org/docs/latest/sql-data-sources-avro.html#to_avro-and-from_avro – Oliver Apr 13 '21 at 10:08

2 Answers2

1

In order to get a specific column from the Row object you can use either row.get(i) or using the column name with row.getAs[T]("columnName"). Here you can check the details of the Row class.

Then your code would look as next:

val rddWithExceptionHandling = filterValueDF.rdd.map { row: Row =>
    val binaryFixedValue = row.getSeq[Byte](6) // or row.getAs[Seq[Byte]]("fixedValue")
    val parsed = Try(from_avro(binaryFixedValue, currentValueSchema.value, fromAvroOptions)) match {
        case Success(parsedValue) => List(parsedValue, null)
        case Failure(ex) => List(null, ex.toString)
    }
    Row.fromSeq(row.toSeq.toList ++ parsed)
}

Although in your case, you don't really need to go into the map function because then you have to work with primitive Scala types when from_avro works with the Dataframe API. This is the reason that you can't call from_avro directly from map since instances of Column class can be used only in combination with the Dataframe API, i.e: df.select($"c1"), here c1 is an instance of Column. In order to use from_avro, as you initially intended, just type:

filterValueDF.select(from_avro($"fixedValue", currentValueSchema))

As @mike already mentioned, if from_avro fails to parse the AVRO content will return null. Finally, if you want to separate succeeded rows from failed, you could do something like:

val includingFailuresDf = filterValueDF.select(
              from_avro($"fixedValue", currentValueSchema) as "avro_res")
             .withColumn("failed", $"avro_res".isNull)

val successDf = includingFailuresDf.where($"failed" === false)
val failedDf = includingFailuresDf.where($"failed" === true) 

Please be aware that the code was not tested.

abiratsis
  • 7,051
  • 3
  • 28
  • 46
  • I see the behavior you are quoting is when the mode PERMISSIVE which is not the default behavior: https://spark.apache.org/docs/latest/sql-data-sources-avro.html#to_avro-and-from_avro – Oliver Apr 13 '21 at 10:08
  • I know it is untested but it's quite close!! However the .isNull is not quite right. What you land up with is a struct of all the properties however they are all null and so $"failed" always is `false` It's almost looking like I should create a java/scala function to extend the `from_avro` to be a little more workable. – Oliver Apr 13 '21 at 10:36
  • @Oliver you are right, my excuses for missing that out. I have some months to use the particular function and forgot some details already. I will update the answer accordingly. – abiratsis Apr 13 '21 at 11:00
  • 1
    @oliver you are right about filtering out failures as well. It seems that you will need some low level validation of the avro. Maybe you can use some existing tools from the Spark avro library where from_avro and to_avro live – abiratsis Apr 13 '21 at 11:30
0

From what i understand you just need to fetch a column for a row . You can probably do that by getting a column value at specific index using row.get()

Shridhar
  • 56
  • 5