Requirement
We are consuming messages from Kafka using PySpark. In these JSON messages, there are some keys corresponding to which we have values such as 0 and 1.
Now the requirement here is to convert these 0's and 1's to False and True while writing these data into Delta Lake in S3.
Issue
By looking at the message, there is no way of identifying for which columns we may need to perform this operation. The only way we can get to know is when reading the schema for the topic, if the data type corresponding to the column name is Boolean, then only we need to convert 0's and 1's to False and True.
If I try to use the schema as it is for my message, the column values become NULL because column has 0's and 1's and when BooleanType is applied on them, these values become NULL.
How can I avoid this issue?
What I have tried
I tried looking out for UserDefinedType in PySpark, but I couldn't find much helpful links, hence I am posting here. Also I tried to do something by inheriting the pyspark.sql.types.BooleanType, but it didn't work.
There should be an easier way to do this which I unable to think of right now.
I need to handle these just before applying the schema else the values in those columns are getting NULL.
Is there a solution?