How to avoid PySpark from_json to return an entire null row on csv reading when some json typed columns have some null attributes

Question

I'm actually facing an issue I hope I can explain.

I'm trying to parse a CSV file with PySpark. This csv file has some JSON columns. Those Json columns have the same Schema, but are not filled the same way.

For instance i have :

{"targetUrl":"https://snowplowanalytics.com/products/snowplow-insights", "elementId":NULL, "elementClasses":NULL,"elementTarget":NULL}

or

{"targetUrl":"https://snowplowanalytics.com/request-demo/", "elementId":"button-request-demo-header-page", "elementClasses":["btn","btn-primary","call-to-action"]}

Atm, when I do :

simpleSchema = st.StructType([
    st.StructField("targetUrl",st.StringType(),True),
    st.StructField("elementId",st.StringType(),True),
    st.StructField("elementClasses",st.StringType(),True)
])

          
df = spark.read.format("csv").option("header","true").option("quoteAll","true").option("escape", "\"").load("./Sources/explore_snowplow_data_raw.csv")
df.select(fn.from_json(fn.col("link_click_event"),simpleSchema).alias("linkJson")).select("linkJson.*").show(50)

(link_click_event is my JSON column name)

Only my second JSON field is fully returned because no values are null.

My problem is that the first row is returned as

+--------------------+--------------------+--------------------+
|           targetUrl|           elementId|      elementClasses|
+--------------------+--------------------+--------------------+
|                null|                null|                null|

How can I reach a result as followed for my first line ?

+--------------------+--------------------+--------------------+
|           targetUrl|           elementId|      elementClasses|
+--------------------+--------------------+--------------------+
|"https://snowplo"...|                null|                null|

Many thanks

Lamanus · Accepted Answer · 2020-08-20T14:32:24.713

Since your json is not stringified (but in your case fine I think), it could not be read correctly for test case. So I made it.

col1
"{\"targetUrl\":\"https://snowplowanalytics.com/products/snowplow-insights\",\"elementId\":null,\"elementClasses\":null,\"elementTarget\":null}"
"{\"targetUrl\":\"https://snowplowanalytics.com/request-demo/\", \"elementId\":\"button-request-demo-header-page\", \"elementClasses\":[\"btn\",\"btn-primary\",\"call-to-action\"]}"

After that with this code,

import pyspark.sql.functions as f
from pyspark.sql import types as st

simpleSchema = st.StructType([
    st.StructField("targetUrl",st.StringType(),True),
    st.StructField("elementId",st.StringType(),True),
    st.StructField("elementClasses",st.ArrayType(st.StringType()),True),
    st.StructField("elementTarget",st.StringType(),True)
])

df.withColumn('col1', f.from_json('col1', simpleSchema)).show(10, False)

+-------------------------------------------------------------------------------------------------------------------+
|col1                                                                                                               |
+-------------------------------------------------------------------------------------------------------------------+
|[https://snowplowanalytics.com/products/snowplow-insights,,,]                                                      |
|[https://snowplowanalytics.com/request-demo/, button-request-demo-header-page, [btn, btn-primary, call-to-action],]|
+-------------------------------------------------------------------------------------------------------------------+

it works fine.

Hey Lamanus, Thank you for the answer. So if I understand well, I had a type problem with my Element Classes Attribute and a problem with my Json Field that is not stringified ? — Stevens Laurent, Aug 20 '20 at 18:52
So is it possible to stringify a JSON column from a spark Dataframe ? — Stevens Laurent, Aug 20 '20 at 19:08
You were right, NULL was a problem. If I change it to null, my row is shown in my results, many thanks. — Stevens Laurent, Aug 21 '20 at 11:54

score 0 · Answer 2 · answered Jan 18 '22 at 14:53

0

primitivesAsString parameter as True worked for me.

primitivesAsString – infers all primitive values as a string type. If None is set, it uses the default value, false.

answered Jan 18 '22 at 14:53

Elias

1

How to avoid PySpark from_json to return an entire null row on csv reading when some json typed columns have some null attributes

2 Answers2