Pyspark dataframe error in Synapase notebook

Question

I am using an Azure Synapse notebook in pyspark to write contents from an api call into a dataframe which is an array and later write it to a database.

Code:

from pyspark.sql.functions import explode 
from pyspark.sql import SparkSession
from pyspark.sql import DataFrame
DataFrame.select(explode("Data").alias("d")).select("d.Date","d.Unit","d.Code")

JSON:

{"Data":[{"Date":"2021-10-01T00:00:00.000-08:00","Unit":"HR","Code":"0012"},{"Date":"1900-10-02T00:00:00.000-05:00","Unit":"HR","Code":"0021"}]}

Schema:

root
 |-- Data: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Date: string (nullable = true)
 |    |    |-- Unit: string (nullable = true)
 |    |    |-- Code: string (nullable = true)

Dataframe:

DataFrame[Code: array<struct<Date:string,Unit:string,Code:string>>]
{"Data":[{"Date":"2021-10-01T00:00:00.000-08:00","Unit":"HR","Code":"0012"},{"Date":"1900-10-02T00:00:00.000-05:00","Unit":"HR","Code":"0021"}]}

Error:

IllegalArgumentException: Can't get JDBC type for array<struct<Date:string,Unit:string,Code:string>>
Traceback (most recent call last):

File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 737, in save self._jwrite.save()

File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ answer, self.gateway_client, self.target_id, self.name)

File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 85, in deco raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)

Write from Dataframe to database:

server_name = "jdbc:sqlserver://ServerName"
database_name = "database_name"
url = server_name + ";" + "databaseName=" + database_name + ";"
table_name = "table_name"
username = "username"
password = "password"

  df.write \
    .format("com.microsoft.sqlserver.jdbc.spark") \
    .mode("overwrite") \
    .option("url", url) \
    .option("dbtable", table_name) \
    .option("user", username) \
    .option("password", password) \
    .save()

Could you please guide me on how to explode the contents of the api json payload and write the data to a database table.

Does `explode("Data")` give you a table like you expect? Where is the database write call? — OneCricketeer, Oct 06 '21 at 18:56
Hi @OneCricketeer, Updated my OP with the database write call. — paone, Oct 06 '21 at 20:14
`df` is this not exploded json `{"Data":[{"Date":"2021-10-01T00:00:00.000-08:00","Unit":"HR","Code":"0012"},{"Date":"1900-10-02T00:00:00.000-05:00","Unit":"HR","Code":"0021"}]}`. I am trying to load this dataframe `df` into a SQL table. — paone, Oct 06 '21 at 21:02
Yes, I know. have you tried to reassign `df` to the result of `. select(explode("Data").alias("d")).select("d.Date","d.Unit","d.Code")` before you `.write` it? Basically, you cannot store an array into a SQL-Server column, as the error says — OneCricketeer, Oct 06 '21 at 21:45

Pyspark dataframe error in Synapase notebook

0 Answers0