This answer is for added context if your JSON strings are JSON Arrays instead of objects (I can't comment since I don't have rep). If you use Martin Tapp's solid answer it will return null values for your columns.
tl;dr
If your JSON strings are array objects like so:
[{"a":1, "b":1.0}]
spark.read.json
will return a dataframe that contains the schema of the elements in those arrays and not the include the array itself. from_json
isn't happy with this, so to be as specific as it wants you can wrap the schema inferred by spark.read.json
in an ArrayType
and it will properly parse (instead of returning null values for everything).
from pyspark.sql import functions as F
from pyspark.sql.types import ArrayType
array_item_schema = \
spark.read.json(df.rdd.map(lambda row: row['json_string_column'])).schema
json_array_schema = ArrayType(array_item_schema, True)
arrays_df = df.select(F.from_json('json_string_column', json_array_schema).alias('json_arrays'))
objects_df = arrays_df.select(F.explode('json_arrays').alias('objects'))
Intro
As an addendum to Nolan Conaway's, it seems that when your JSON is of the form
[
{
"a": 1.0,
"b": 1
},
{
"a": 0.0,
"b": 2
}
]
where the top level object is an array (and not an object), pyspark's spark.read.json()
treats the array as a collection of objects to be converted into rows instead of a single row.
See example run in PySpark 3.3.0 shell:
>>> myjson = """[{"a": 1.0,"b": 1},{"a": 2.0,"b": 2}]"""
>>> myotherjson = """[{"a": 3.0,"b": 3}]"""
>>> rawobjectjson = """{"a": 4.0,"b": 4}"""
>>> spark_read_df = spark.read.json(sc.parallelize([myjson,myotherjson,rawobjectjson]))
>>> spark_read_df.show()
+---+---+
| a| b|
+---+---+
|1.0| 1|
|2.0| 2|
|3.0| 3|
|4.0| 4|
+---+---+
>>> spark_read_df.printSchema()
root
|-- a: double (nullable = true)
|-- b: long (nullable = true)
We can see that myjson
and myotherjson
which were JSON arrays of JSON objects got expanded to have a row for each object they contained. It also smoothly handled when one of the JSON strings rawobjectjson
is just a raw object. I think the documentation falls a little short here, as I couldn't find mention of this handling for array objects.
Now let's create a dataframe with a column of JSON strings. Going to drop the rawobjectjson
because as we'll see from_json
requires each string to have the same schema (and this includes the top level array if present).
>>> from pyspark.sql.types import StructType, StructField, StringType, ArrayType
>>> json_string_data = [
... (myjson,),
... (myotherjson,),
... ]
>>> json_df_schema = StructType([
... StructField('json_strings', StringType(), True),
... ])
>>> raw_json_df = spark.createDataFrame(data=json_string_data, schema=json_df_schema)
>>> raw_json_df.show()
+--------------------+
| json_strings|
+--------------------+
|[{"a": 1.0,"b": 1...|
| [{"a": 3.0,"b": 3}]|
+--------------------+
Now here's where I tried to use the schema inferred by spark.read.json
to pass to from_json
to read the JSON column to objects, but it kept returning columns that were fully null
. As Nolan Conaway mentioned this will happen when the schema passed to from_json
couldn't be applied to the given strings.
The issue is that in these strings it sees the top level as an array, but as spark_read_df.printSchema()
shows, the schema inferred by spark.read.json()
ignores the array level.
The Solution
So the solution I ended up going with was just accounting for the top level array in the schema when doing the read.
from pyspark.sql import functions as F
# This one won't work for directly passing to from_json as it ignores top-level arrays in json strings
# (if any)!
# json_object_schema = spark_read_df.schema()
# from_json is a bit more "simple", it directly applies the schema to the string. In this case
# the top level type is actually an array, so a simple fix is to just wrap the schema that
# spark.read.json returned in an ArrayType to match the true JSON string
json_array_schema = ArrayType(spark_read_df.schema, True)
json_extracted_df = raw_json_df.select(
F.from_json('json_strings', json_array_schema)
.alias('json_arrays')
)
>>> json_extracted_df.show()
+--------------------+
| json_arrays|
+--------------------+
|[{1.0, 1}, {2.0, 2}]|
| [{3.0, 3}]|
+--------------------+
>>> json_extracted_df.printSchema()
root
|-- json_arrays: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: double (nullable = true)
| | |-- b: long (nullable = true)
From there the objects can be pulled out of the array using pyspark.sql.functions.explode
:
>>> exploded_df = json_extracted_df.select(F.explode('json_arrays').alias('objects'))
>>> exploded_df.show()
+--------+
| objects|
+--------+
|{1.0, 1}|
|{2.0, 2}|
|{3.0, 3}|
+--------+
>>> exploded_df.printSchema()
root
|-- objects: struct (nullable = true)
| |-- a: double (nullable = true)
| |-- b: long (nullable = true)