How to parse JSON string of nested lists to spark data frame in pyspark ?
Input data frame:
+-------------+-----------------------------------------------+
|url |json |
+-------------+-----------------------------------------------+
|https://url.a|[[1572393600000, 1.000],[1572480000000, 1.007]]|
|https://url.b|[[1572825600000, 1.002],[1572912000000, 1.000]]|
+-------------+-----------------------------------------------+
root
|-- url: string (nullable = true)
|-- json: string (nullable = true)
Expected output:
+---------------------------------------+
|col_1 | col_2 | col_3 |
+---------------------------------------+
| a | 1572393600000 | 1.000 |
| a | 1572480000000 | 1.007 |
| b | 1572825600000 | 1.002 |
| b | 1572912000000 | 1.000 |
+---------------------------------------+
Example code:
import pyspark
import pyspark.sql.functions as F
spark = (pyspark.sql.SparkSession.builder.appName("Downloader_standalone")
.master('local[*]')
.getOrCreate())
sc = spark.sparkContext
from pyspark.sql import Row
rdd_list = [('https://url.a','[[1572393600000, 1.000],[1572480000000, 1.007]]'),
('https://url.b','[[1572825600000, 1.002],[1572912000000, 1.000]]')]
jsons = sc.parallelize(rdd_list)
df = spark.createDataFrame(jsons, "url string, json string")
df.show(truncate=False)
df.printSchema()
(df.withColumn('json', F.from_json(F.col('json'),"array<string,string>"))
.select(F.explode('json').alias('col_1', 'col_2', 'col_3')).show())
There are few examples, but I can not figure out how to do it: