Using Pyspark to read JSON items from an array?

Question

I'm having some issues with reading items from Cosmos DB in databricks, it seems to read the JSON as a string value, and having some issues getting the data out of it to columns.

I have a column called ProductRanges with the following values in a row:

[   {
        "name": "Red",
        "min": 0,
        "max": 99,
        "value": "Order More"
    },
    {
        "name": "Amber",
        "min": 100,
        "max": 499,
        "value": "Stock OK"
    },
    {
        "name": "Green",
        "min": 500,
        "max": 1000000,
        "value": "Overstocked"
    }
]

In Cosmos DB the JSON document is valid, out when importing the data the datatype in the dataframe is a string, not a JSON object/struct as I would expect.

I would like the be able to get count the number of times "name" comes up and iterate through them the get the min, max and value items, as the number of ranges that we can have can be more than 3. I've been though a few post on stackoverflow and other places but stuck on the formatting. I've tried to use the explode and read the schema based in the column values, but it does say 'in vaild document', think it may be due to Pyspark needing {} at the start and the end, but even concatenating that in the SQL query from cosmos db still ends up as the datatype of string.

Any pointers would be appreciated

Update: After some investigation, it looks like that its being read as a string from CosmosDB. If I create a new document, and call it productRangesAlt, and import it, it reads it as an array and i can explode it out. So at some point maybe productRanges was a string, however CosmosDB is not picking up the change. At the moment the Cosmos DB environment is a dev I can wipe that schema and see if a reload will fix it — Jon, May 14 '19 at 13:17

score 6 · Accepted Answer · answered May 14 '19 at 12:12

I see you retrieved JSON documents from Azure CosmosDB and convert them to PySpark DataFrame, but the nested JSON document or array could not be transformed as a JSON object in a DataFrame column as you expected, because there is not a JSON type defined in pyspark.sql.types module, as below.

I searched a document PySpark: Convert JSON String Column to Array of Object (StructType) in Data Frame which be a suitable solution for your current case, even the same as you want, while I was trying to solve it.

The document above shows how to use ArrayType, StructType, StructField and other base PySpark datatypes to convert a JSON string in a column to a combined datatype which can be processed easier in PySpark via define the column schema and an UDF.

Here is the summary of sample code. Hope it helps.

source = [{"attr_1": 1, "attr_2": "[{\"a\":1,\"b\":1},{\"a\":2,\"b\":2}]"}, {"attr_1": 2, "attr_2": "[{\"a\":3,\"b\":3},{\"a\":4,\"b\":4}]"}]

JSON is read into a data frame through sqlContext. The output is:

+------+--------------------+

|attr_1|              attr_2|

+------+--------------------+

|     1|[{"a":1,"b":1},{"...|

|     2|[{"a":3,"b":3},{"...|

+------+--------------------+


root
  |-- attr_1: long (nullable = true)
  |-- attr_2: string (nullable = true)

Then, to convert the attr_2 column via define column schema and UDF.

# Function to convert JSON array string to a list
import json

def parse_json(array_str):
    json_obj = json.loads(array_str)
    for item in json_obj:
        yield (item["a"], item["b"])

# Define the schema
from pyspark.sql.types import ArrayType, IntegerType, StructType, StructField

json_schema = ArrayType(StructType([StructField('a', IntegerType(
), nullable=False), StructField('b', IntegerType(), nullable=False)]))

# Define udf
from pyspark.sql.functions import udf

udf_parse_json = udf(lambda str: parse_json(str), json_schema)

# Generate a new data frame with the expected schema

df_new = df.select(df.attr_1, udf_parse_json(df.attr_2).alias("attr_2"))
df_new.show()
df_new.printSchema()

The output is as the following:

+------+--------------+

|attr_1|        attr_2|

+------+--------------+

|     1|[[1,1], [2,2]]|

|     2|[[3,3], [4,4]]|

+------+--------------+


root
  |-- attr_1: long (nullable = true)
  |-- attr_2: array (nullable = true)
  |    |-- element: struct (containsNull = true)
  |    |    |-- a: integer (nullable = false)
  |    |    |-- b: integer (nullable = false)

I get the error, TypeError: the JSON object must be str, bytes or bytearray, not list. Would the code change since May '19? — Christa, Jan 27 '21 at 16:47

Using Pyspark to read JSON items from an array?

1 Answers1