1

There is some problem trying to deserialize data from .avro file. My process consists of these steps:

  1. reading from Kafka
df = (
        spark.read.format("kafka")
        .option("kafka.security.protocol", "PLAINTEXT")
        .option("kafka.sasl.mechanism", "GSSAPI")
        .option("kafka.bootstrap.servers", KAFKA_BROKERS)
        .option("subscribe", KAKFA_TOPIC)
        .option("group.id", "noid")
        .option("startingOffsets", "earliest")
        .option("failOnDataLoss", "false")
        .option("maxOffsetsPerTrigger", 2)
        .load()
        .selectExpr(
            "CAST(value AS binary)"
        )
    )

  1. Writing to hdfs
df = df.select("value")
df.write.format("avro").save("path/to/avro_files") 
  1. Loading .avro file
schema = """{
           "type":"record",
           "name":"GPRSIOT",
           "fields":[
              {
                 "name":"field_1",
                 "type":"string",
                 "default":"null"
              }
...
              ]
}


dff= spark.read.format("avro") \
.option("avroSchema", schema) \
.load("path/to/avro_files/part-00000-ff1186a6-dce4-492e-87b8-8bb4332f4bc1-c000.avro")

The problem is that I get nulls as values.

+-------+-------+-------+
|field_1|field_2|field_3|
+-------+-------+-------+
|null   |0      |null   |
+-------+-------+-------+

A small part of my .avro file:

  Ɖ�v|y
f�30387f��������������������������������������Z�19807F� ؖ��1.2�A0Z��"�b
                                                                       0FDDZv�
169r&Z��~�~�~�~�~�~�~�~�~�~�~�~�~�~�~�~�~�~�~Z~32210F�+��ߘv~y�1.1Z}*�L2FF3E�N�
�~841576��+��+��+��+��+��+��+��+��+��+��+��+��+��+��+��+��+��+��+��+05321F�

Any ideas on how to deserialize an avro file with real values instead of nulls?

GFR
  • 49
  • 2

0 Answers0