How to specify the type of PySpark Dataframe column as JSON

Question

Following is our pyspark application code snippet.

schema = StructType(
    [
        StructField('name', StringType(), True),
        StructField('version', StringType(), True),
        StructField('requestBody', StringType(), True),
        StructField('id', StringType(), True),
    ]
)

df_new = df.withColumn('value', from_json('value', schema)) \
    .where(col('value.version') == '1') \
    .select(col('value.*'))\
    .na.drop() \
    .withColumn('requestBody', decrypt_udf(col('requestBody')))

df_new.show()

+-------+--------+---------------------------------------------+---+
|   name| version|                                  requestBody| id|
+-------+--------+---------------------------------------------+---+
|kj-test|       1|{"data": {"score": 130, "group": "silver"}}  |  1|
|kj-test|       1|{"data": {"score": 250, "group": "gold"}}    |  2|
|kj-test|       1|{"data": {"score": 330, "group": "platinum"}}|  3|
+-------+--------+---------------------------------------------+---+

The decrypt_udf UDF function snippet:

@udf(returnType=StringType())
def decrypt_udf(encrypted_string: str):
   ...
   ...
   return decrypted_json_str

When I write the spark dataframe to the S3 bucket as follows

df_new.write.mode('overwrite').json(path=s3outputpath)

The resulting file has content as follows, here the value of requestBody is written as String hence in double-quotes and also escaping the inner double quotes.

{"name":"kj-test","version":"1","requestBody":"{\"data\": {\"score\": 130, \"group\": \"silver\"}}","id":"1"}
{"name":"kj-test","version":"2","requestBody":"{\"data\": {\"score\": 250, \"group\": \"gold\"}}","id":"1"}
{"name":"kj-test","version":"3","requestBody":"{\"data\": {\"score\": 330, \"group\": \"platinum\"}}","id":"1"}

However, I am expecting the value of requestBody to be written as a json as below.

{"name":"kj-test","version":"1","requestBody":{"data": {"score": 130, "group": "silver"}},"id":"1"}

I understand that I have specified the type for requestBody as string in the schema StructField('requestBody', StringType(), True) and hence I see the output that way. How can I achieve the output that I am expecting? There is no such type as JsonType

EDIT:

Please note that my requestBody schema will not be always like this {"data": {"score": 130, "group": "silver"}}. For a given run it is fixed but another run could have a totally different schema.

Essentially, need a way to infer the schema from the json string. Found some SO posts which could be helpful, will try these out:

https://stackoverflow.com/a/45880574/948268
Spark from_json with dynamic schema

Have you tried to change your UDF and return as `MapType` instead of `StringType`? — pltc, Oct 25 '21 at 19:42
@pltc `MapType` does not cover all possibilities of json. Like I could not specify the corresponding type for value in the MapType. — Kuldeep Jain, Oct 27 '21 at 03:52

score 0 · Answer 1 · answered Oct 23 '21 at 03:14

Try below code. (I have not tested)

Convert requestBody json string to struct using from_json function.

schema = StructType(
    [
        StructField('name', StringType(), True),
        StructField('version', StringType(), True),
        StructField('requestBody', StringType(), True),
        StructField('id', StringType(), True),
    ]
)

Prepare schema for requestBody

requestSchema=StructType(
    [
        StructField('data', StructType([StructField('group',StringType(),True),StructField('score',LongType(),True)])),
    ]
)

df_new = df.withColumn('value', from_json('value', schema)) \
    .where(col('value.version') == '1') \
    .select(col('value.*'))\
    .withColumn()
    .na.drop() \
    .withColumn('requestBody', from_json('requestBody',requestSchema))

df_new.write.mode('overwrite').json(path=s3outputpath)

Thanks @Srinivas, my requestBody schema is not fixed. It could be any json. Is there a way to infer the schema based on the requestBody value? — Kuldeep Jain, Oct 23 '21 at 18:38

score 0 · Answer 2 · answered Oct 25 '21 at 18:41

0

In your udf, add the following method which converts a python object to a JSON string:

import json   
@udf(returnType=StringType())
def decrypt_udf(encrypted_string: str):
   ...
   ...
   return json.dumps(decrypted_json_str)

answered Oct 25 '21 at 18:41

greenie

409
3
6

Thanks greenie. The issue is not in my udf, as you can see in the output of `df_new.show()` that it's showing correctly. In fact I already do `json.dumps` in my udf. But when it's written as the type of the `requestBody` field is `StringType` hence I have this issue. – Kuldeep Jain Oct 25 '21 at 18:48

Kuldeep Jain · Accepted Answer · 2021-11-04T05:49:53.467

NEWER SOLUTION (I think this is a better one)

Another clever solution which we finally used. In this we have defined a udf get_combined_json which combines all the columns for given Row and then returns a json string. Resulting in our final dataframe to have a single column so that we can write the dataframe as a text file that way the entire json string is written as it is without any escaping. Following is the code snippet:

df_new = df.withColumn('value', from_json('value', schema)) \
    .where(col('value.version') == '1') \
    .select(col('value.*'))\
    .na.drop() \
    .withColumn('requestBody', decrypt_udf(col('requestBody')))

df_new.withColumn('combinedColumns', get_combined_json(struct([df_new[x] for x in df_new.columns]))) \
    .select(col('combinedColumns'))\
    .write.mode('overwrite').text(path=output_s3_bucket_path)

...

@udf(returnType=StringType())
def get_combined_json(row: Row):
    return json.dumps({"requestBody": json.loads(row.requestBody),
                       "name": row.name,
                       "version": row.version,
                       "id": row.id})

OLDER SOLUTION

Here is how we derived/inferred the schema from the requestBody json string:

request_body_schema = spark_session.read.json(df_new.rdd.map(lambda r: r.requestBody)).schema

And then used the schema to update the dataframe. This is final code that worked:

df_new = df.withColumn('value', from_json('value', schema)) \
    .where(col('value.version') == '1') \
    .select(col('value.*'))\
    .na.drop() \
    .withColumn('requestBody', decrypt_udf(col('requestBody')))

request_body_schema = spark_session.read.json(df_new.rdd.map(lambda r: r.requestBody)).schema

df_new = df_new.withColumn('requestBody', from_json(col('requestBody'), request_body_schema))

df_new.write.mode('overwrite').json(path=output_s3_bucket_path)

And following is the output format written to the S3 bucket:

{"name":"kj-test","version":"1","requestBody":{"data": {"score": 130, "group": "silver"}},"id":"1"}

How to specify the type of PySpark Dataframe column as JSON

3 Answers3