3

I am using the code below to read from a rest api and write the response to a json document in pyspark and save the file to Azure Data Lake Gen2. The code works fine when the response has no blank data but when I try to get all the data back then run into the following error.

Error Message: ValueError: Some of types cannot be determined after inferring.

Code:

import requests
response = requests.get('https://apiurl.com/demo/api/v3/data',
                         auth=('user', 'password'))
data = response.json()
from pyspark.sql import *
df=spark.createDataFrame([Row(**i) for i in data])
df.show()
df.write.mode("overwrite").json("wasbs://<file_system>@<storage-account-name>.blob.core.windows.net/demo/data")

Response:

[
    {
        "ProductID": "156528",
        "ProductType": "Home Improvement",
        "Description": "",
        "SaleDate": "0001-01-01T00:00:00",
        "UpdateDate": "2015-02-01T16:43:18.247"
    },
    {
        "ProductID": "126789",
        "ProductType": "Pharmacy",
        "Description": "",
        "SaleDate": "0001-01-01T00:00:00",
        "UpdateDate": "2015-02-01T16:43:18.247"
    }
]

Trying to fix the schema like below.

from pyspark.sql.types import StructType, StructField, StringType
schema = StructType([StructField("ProductID", StringType(), True), StructField("ProductType", StringType(), True), "Description", StringType(), True), StructField("SaleDate", StringType(), True), StructField("UpdateDate", StringType(), True)])
df = spark.createDataFrame([[None, None, None, None, None]], schema=schema)
df.show()

Not sure how to create the dataframe and write data to json document.

paone
  • 828
  • 8
  • 18

1 Answers1

3

You can pass the data,schema variable to spark.createDataFrame() then spark will create a dataframe.

Example:

from pyspark.sql.functions import *
from pyspark.sql import *
from pyspark.sql.types import *


data=[
    {
        "ProductID": "156528",
        "ProductType": "Home Improvement",
        "Description": "",
        "SaleDate": "0001-01-01T00:00:00",
        "UpdateDate": "2015-02-01T16:43:18.247"
    },
    {
        "ProductID": "126789",
        "ProductType": "Pharmacy",
        "Description": "",
        "SaleDate": "0001-01-01T00:00:00",
        "UpdateDate": "2015-02-01T16:43:18.247"
    }
]

schema = StructType([StructField("ProductID", StringType(), True), StructField("ProductType", StringType(), True), StructField("Description", StringType(), True), StructField("SaleDate", StringType(), True), StructField("UpdateDate", StringType(), True)])


df = spark.createDataFrame(data, schema=schema)

df.show()
#+---------+----------------+-----------+-------------------+--------------------+
#|ProductID|     ProductType|Description|           SaleDate|          UpdateDate|
#+---------+----------------+-----------+-------------------+--------------------+
#|   156528|Home Improvement|           |0001-01-01T00:00:00|2015-02-01T16:43:...|
#|   126789|        Pharmacy|           |0001-01-01T00:00:00|2015-02-01T16:43:...|
#+---------+----------------+-----------+-------------------+--------------------+
notNull
  • 30,258
  • 4
  • 35
  • 50