5

I am trying to save the data from Spark dataframe to HDFS using Avro schema stored in the schema registry. However, I get an error while writing the data:

Caused by: org.apache.avro.AvroRuntimeException: Not a union: {"type":"long","logicalType":"timestamp-millis"}
    at org.apache.avro.Schema.getTypes(Schema.java:299)
    at org.apache.spark.sql.avro.AvroSerializer.org$apache$spark$sql$avro$AvroSerializer$$resolveNullableType(AvroSerializer.scala:229)
    at org.apache.spark.sql.avro.AvroSerializer$$anonfun$3.apply(AvroSerializer.scala:209)
    at org.apache.spark.sql.avro.AvroSerializer$$anonfun$3.apply(AvroSerializer.scala:208)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.immutable.List.foreach(List.scala:392)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.immutable.List.map(List.scala:296)
    at org.apache.spark.sql.avro.AvroSerializer.newStructConverter(AvroSerializer.scala:208)
    at org.apache.spark.sql.avro.AvroSerializer.<init>(AvroSerializer.scala:51)
    at org.apache.spark.sql.avro.AvroOutputWriter.serializer$lzycompute(AvroOutputWriter.scala:42)
    at org.apache.spark.sql.avro.AvroOutputWriter.serializer(AvroOutputWriter.scala:42)
    at org.apache.spark.sql.avro.AvroOutputWriter.write(AvroOutputWriter.scala:64)
    at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:137)

What could be the reason for it?

The fields in Avro schema look like this:

{"name":"CreateDate","type":["null",{"type":"long","logicalType":"timestamp-millis"}],"default":null}

Here is an example of the date format:

1900-01-01 00:00:00

The data type of this field in Spark dataframe:

|-- CreateDate: timestamp (nullable = true)

This is the way I write the data:

dataDF.write
  .mode("append")
  .format("avro")
  .option(
    "avroSchema",
    SchemaRegistry.getSchema(
      schemaRegistryConfig.url,
      schemaRegistryConfig.dataSchemaSubject,
      schemaRegistryConfig.dataSchemaVersion))
  .save(hdfsURL)
Cassie
  • 2,941
  • 8
  • 44
  • 92
  • The problem seems to be that `CreateDate` in your code is not a union type, but a primitive `long` type, which spark converts to a non union `timestamp-millis` non-nullable logical type (you can read about this in https://spark.apache.org/docs/latest/sql-data-sources-avro.html#supported-types-for-avro---spark-sql-conversion) – Yuval Itzchakov Jun 19 '19 at 14:38
  • In order to turn the column into a nullable one, see https://stackoverflow.com/questions/33193958/change-nullable-property-of-column-in-spark-dataframe – Yuval Itzchakov Jun 19 '19 at 14:43
  • @YuvalItzchakov I am not sure as in Spark dataframe it is a timestamp type – Cassie Jun 19 '19 at 15:03
  • Oh, I missed the fact that the timestamp column is already set to nullable in your DF schema. Hmmm.. – Yuval Itzchakov Jun 19 '19 at 15:07
  • Can you try and debug the `AvroSerializer` class and see how it's treating that column? Also, do you have any other `TimestampType` fields in your dataDF? – Yuval Itzchakov Jun 19 '19 at 15:28
  • Yes, I have about 5 columns of Timestamp Type with the same format. How can I debug AvroSerializer? – Cassie Jun 20 '19 at 06:02
  • Can you show the entire schema for `dataDF` and the full avro schema you get back from the registry? – Yuval Itzchakov Jun 20 '19 at 07:08

0 Answers0