Convert some specific columns that have 0 and 1 values in Kafka messages to False and True in PySpark

Question

Requirement

We are consuming messages from Kafka using PySpark. In these JSON messages, there are some keys corresponding to which we have values such as 0 and 1.

Now the requirement here is to convert these 0's and 1's to False and True while writing these data into Delta Lake in S3.

Issue

By looking at the message, there is no way of identifying for which columns we may need to perform this operation. The only way we can get to know is when reading the schema for the topic, if the data type corresponding to the column name is Boolean, then only we need to convert 0's and 1's to False and True.

If I try to use the schema as it is for my message, the column values become NULL because column has 0's and 1's and when BooleanType is applied on them, these values become NULL.

How can I avoid this issue?

What I have tried

I tried looking out for UserDefinedType in PySpark, but I couldn't find much helpful links, hence I am posting here. Also I tried to do something by inheriting the pyspark.sql.types.BooleanType, but it didn't work.

There should be an easier way to do this which I unable to think of right now.

I need to handle these just before applying the schema else the values in those columns are getting NULL.

Is there a solution?

How do you apply the `BooleanType` to the column? Could you share more about your logic? — Jonathan Lam, Mar 20 '23 at 07:01
I use a JSON response that I fetch from an API. The response is similar to this. ```{"fields":[{"metadata":{},"name":"name","nullable":true,"type":"string"},{"metadata":{},"name":"age","nullable":true,"type":"integer"},{"metadata":{},"name":"is_employee","nullable":true,"type":"boolean"}],"type":"struct"}``` Then I use ```schema = StructType.fromJson(schema)``` to generate the schema and apply the schema to the Kafka Stream. That's when the ```BooleanType``` is applied on the column having 0's and 1's and everything becomes NULL for that column. — tall-e.stark, Mar 22 '23 at 06:03

Convert some specific columns that have 0 and 1 values in Kafka messages to False and True in PySpark

Requirement

Issue

What I have tried

0 Answers0