There is some problem trying to deserialize data from .avro file. My process consists of these steps:
- reading from Kafka
df = (
spark.read.format("kafka")
.option("kafka.security.protocol", "PLAINTEXT")
.option("kafka.sasl.mechanism", "GSSAPI")
.option("kafka.bootstrap.servers", KAFKA_BROKERS)
.option("subscribe", KAKFA_TOPIC)
.option("group.id", "noid")
.option("startingOffsets", "earliest")
.option("failOnDataLoss", "false")
.option("maxOffsetsPerTrigger", 2)
.load()
.selectExpr(
"CAST(value AS binary)"
)
)
- Writing to hdfs
df = df.select("value")
df.write.format("avro").save("path/to/avro_files")
- Loading .avro file
schema = """{
"type":"record",
"name":"GPRSIOT",
"fields":[
{
"name":"field_1",
"type":"string",
"default":"null"
}
...
]
}
dff= spark.read.format("avro") \
.option("avroSchema", schema) \
.load("path/to/avro_files/part-00000-ff1186a6-dce4-492e-87b8-8bb4332f4bc1-c000.avro")
The problem is that I get nulls as values.
+-------+-------+-------+
|field_1|field_2|field_3|
+-------+-------+-------+
|null |0 |null |
+-------+-------+-------+
A small part of my .avro file:
Ɖ�v|y
f�30387f��������������������������������������Z�19807F� ؖ��1.2�A0Z��"�b
0FDDZv�
169r&Z��~�~�~�~�~�~�~�~�~�~�~�~�~�~�~�~�~�~�~Z~32210F�+��ߘv~y�1.1Z}*�L2FF3E�N�
�~841576��+��+��+��+��+��+��+��+��+��+��+��+��+��+��+��+��+��+��+��+05321F�
Any ideas on how to deserialize an avro file with real values instead of nulls?