1

I'm testing Spark 2.4.0 new from_avro and to_avro functions.

I create a dataframe with just one column and three rows, serialize it with avro, and deserialize it back from avro.

If the input dataset is created as

val input1 = Seq("foo", "bar", "baz").toDF("key")

+---+
|key|
+---+
|foo|
|bar|
|baz|
+---+

the deserialization just returns N copies of the last row:

+---+
|key|
+---+
|baz|
|baz|
|baz|
+---+

If I create the input dataset as

val input2 = input1.sqlContext.createDataFrame(input1.rdd, input1.schema)

the results are correct:

+---+
|key|
+---+
|foo|
|bar|
|baz|
+---+

Example code:

import org.apache.spark.sql.avro.{SchemaConverters, from_avro, to_avro}
import org.apache.spark.sql.DataFrame

val input1 = Seq("foo", "bar", "baz").toDF("key")
val input2 = input1.sqlContext.createDataFrame(input1.rdd, input1.schema)

def test_avro(df: DataFrame): Unit = {
  println("input df:")
  df.printSchema()
  df.show()

  val keySchema = SchemaConverters.toAvroType(df.schema).toString
  println(s"avro schema: $keySchema")

  val avroDf = df
    .select(to_avro($"key") as "key")

  println("avro serialized:")
  avroDf.printSchema()
  avroDf.show()

  val output = avroDf
    .select(from_avro($"key", keySchema) as "key")
    .select("key.*")

  println("avro deserialized:")
  output.printSchema()
  output.show()
}

println("############### testing .toDF()")
test_avro(input1)
println("############### testing .createDataFrame()")
test_avro(input2)

Result:

############### testing .toDF()
input df:
root
 |-- key: string (nullable = true)

+---+
|key|
+---+
|foo|
|bar|
|baz|
+---+

avro schema: {"type":"record","name":"topLevelRecord","fields":[{"name":"key","type":["string","null"]}]}
avro serialized:
root
 |-- key: binary (nullable = true)

+----------------+
|             key|
+----------------+
|[00 06 66 6F 6F]|
|[00 06 62 61 72]|
|[00 06 62 61 7A]|
+----------------+

avro deserialized:
root
 |-- key: string (nullable = true)

+---+
|key|
+---+
|baz|
|baz|
|baz|
+---+

############### testing .createDataFrame()
input df:
root
 |-- key: string (nullable = true)

+---+
|key|
+---+
|foo|
|bar|
|baz|
+---+

avro schema: {"type":"record","name":"topLevelRecord","fields":[{"name":"key","type":["string","null"]}]}
avro serialized:
root
 |-- key: binary (nullable = true)

+----------------+
|             key|
+----------------+
|[00 06 66 6F 6F]|
|[00 06 62 61 72]|
|[00 06 62 61 7A]|
+----------------+

avro deserialized:
root
 |-- key: string (nullable = true)

+---+
|key|
+---+
|foo|
|bar|
|baz|
+---+

From the test it seems the problem is in the deserialization phase as printing the avro serialized df shows different rows.

Am I doing it wrong or is there a bug?

redsk
  • 261
  • 6
  • 11

1 Answers1

2

Seems like it is a bug. I filed a bug report and it's now fixed in the 2.3 and 2.4 branches.