I'm actually facing an issue I hope I can explain.
I'm trying to parse a CSV file with PySpark. This csv file has some JSON columns. Those Json columns have the same Schema, but are not filled the same way.
For instance i have :
{"targetUrl":"https://snowplowanalytics.com/products/snowplow-insights", "elementId":NULL, "elementClasses":NULL,"elementTarget":NULL}
or
{"targetUrl":"https://snowplowanalytics.com/request-demo/", "elementId":"button-request-demo-header-page", "elementClasses":["btn","btn-primary","call-to-action"]}
Atm, when I do :
simpleSchema = st.StructType([
st.StructField("targetUrl",st.StringType(),True),
st.StructField("elementId",st.StringType(),True),
st.StructField("elementClasses",st.StringType(),True)
])
df = spark.read.format("csv").option("header","true").option("quoteAll","true").option("escape", "\"").load("./Sources/explore_snowplow_data_raw.csv")
df.select(fn.from_json(fn.col("link_click_event"),simpleSchema).alias("linkJson")).select("linkJson.*").show(50)
(link_click_event is my JSON column name)
Only my second JSON field is fully returned because no values are null.
My problem is that the first row is returned as
+--------------------+--------------------+--------------------+
| targetUrl| elementId| elementClasses|
+--------------------+--------------------+--------------------+
| null| null| null|
How can I reach a result as followed for my first line ?
+--------------------+--------------------+--------------------+
| targetUrl| elementId| elementClasses|
+--------------------+--------------------+--------------------+
|"https://snowplo"...| null| null|
Many thanks