0

I'm actually facing an issue I hope I can explain.

I'm trying to parse a CSV file with PySpark. This csv file has some JSON columns. Those Json columns have the same Schema, but are not filled the same way.

For instance i have :

{"targetUrl":"https://snowplowanalytics.com/products/snowplow-insights", "elementId":NULL, "elementClasses":NULL,"elementTarget":NULL}

or

{"targetUrl":"https://snowplowanalytics.com/request-demo/", "elementId":"button-request-demo-header-page", "elementClasses":["btn","btn-primary","call-to-action"]}

Atm, when I do :

simpleSchema = st.StructType([
    st.StructField("targetUrl",st.StringType(),True),
    st.StructField("elementId",st.StringType(),True),
    st.StructField("elementClasses",st.StringType(),True)
])

          
df = spark.read.format("csv").option("header","true").option("quoteAll","true").option("escape", "\"").load("./Sources/explore_snowplow_data_raw.csv")
df.select(fn.from_json(fn.col("link_click_event"),simpleSchema).alias("linkJson")).select("linkJson.*").show(50)

(link_click_event is my JSON column name)

Only my second JSON field is fully returned because no values are null.

My problem is that the first row is returned as

+--------------------+--------------------+--------------------+
|           targetUrl|           elementId|      elementClasses|
+--------------------+--------------------+--------------------+
|                null|                null|                null|

How can I reach a result as followed for my first line ?

+--------------------+--------------------+--------------------+
|           targetUrl|           elementId|      elementClasses|
+--------------------+--------------------+--------------------+
|"https://snowplo"...|                null|                null|

Many thanks

2 Answers2

1

Since your json is not stringified (but in your case fine I think), it could not be read correctly for test case. So I made it.

col1
"{\"targetUrl\":\"https://snowplowanalytics.com/products/snowplow-insights\",\"elementId\":null,\"elementClasses\":null,\"elementTarget\":null}"
"{\"targetUrl\":\"https://snowplowanalytics.com/request-demo/\", \"elementId\":\"button-request-demo-header-page\", \"elementClasses\":[\"btn\",\"btn-primary\",\"call-to-action\"]}"

After that with this code,

import pyspark.sql.functions as f
from pyspark.sql import types as st

simpleSchema = st.StructType([
    st.StructField("targetUrl",st.StringType(),True),
    st.StructField("elementId",st.StringType(),True),
    st.StructField("elementClasses",st.ArrayType(st.StringType()),True),
    st.StructField("elementTarget",st.StringType(),True)
])

df.withColumn('col1', f.from_json('col1', simpleSchema)).show(10, False)

+-------------------------------------------------------------------------------------------------------------------+
|col1                                                                                                               |
+-------------------------------------------------------------------------------------------------------------------+
|[https://snowplowanalytics.com/products/snowplow-insights,,,]                                                      |
|[https://snowplowanalytics.com/request-demo/, button-request-demo-header-page, [btn, btn-primary, call-to-action],]|
+-------------------------------------------------------------------------------------------------------------------+

it works fine.

Lamanus
  • 12,898
  • 4
  • 21
  • 47
0

primitivesAsString parameter as True worked for me.

primitivesAsString – infers all primitive values as a string type. If None is set, it uses the default value, false.

Elias
  • 1