2

How to parse JSON string of nested lists to spark data frame in pyspark ?

Input data frame:

+-------------+-----------------------------------------------+
|url          |json                                           |
+-------------+-----------------------------------------------+
|https://url.a|[[1572393600000, 1.000],[1572480000000, 1.007]]|
|https://url.b|[[1572825600000, 1.002],[1572912000000, 1.000]]|
+-------------+-----------------------------------------------+

root
 |-- url: string (nullable = true)
 |-- json: string (nullable = true)

Expected output:

+---------------------------------------+
|col_1 | col_2               | col_3    |
+---------------------------------------+
| a    | 1572393600000       |  1.000   | 
| a    | 1572480000000       |  1.007   |
| b    | 1572825600000       |  1.002   |
| b    | 1572912000000       |  1.000   |
+---------------------------------------+

Example code:

import pyspark
import pyspark.sql.functions as F

spark = (pyspark.sql.SparkSession.builder.appName("Downloader_standalone")
    .master('local[*]')
    .getOrCreate())

sc = spark.sparkContext
from pyspark.sql import Row

rdd_list  = [('https://url.a','[[1572393600000, 1.000],[1572480000000, 1.007]]'),
             ('https://url.b','[[1572825600000, 1.002],[1572912000000, 1.000]]')]

jsons = sc.parallelize(rdd_list) 

df = spark.createDataFrame(jsons, "url string, json string")
df.show(truncate=False)
df.printSchema()


(df.withColumn('json', F.from_json(F.col('json'),"array<string,string>"))
.select(F.explode('json').alias('col_1', 'col_2', 'col_3')).show())

There are few examples, but I can not figure out how to do it:

Dan
  • 437
  • 7
  • 24

3 Answers3

2

With some replacements in the strings and by splitting you can get the desired result:

from pyspark.sql import functions as F

df1 = df.withColumn(
    "col_1",
    F.regexp_replace("url", "https://url.", "")
).withColumn(
    "col_2_3",
    F.explode(
        F.expr("""transform(
            split(trim(both '][' from json), '\\\],\\\['), 
            x -> struct(split(x, ',')[0] as col_2, split(x, ',')[1] as col_3)
        )""")
    )
).selectExpr("col_1", "col_2_3.*")

df1.show(truncate=False)

#+-----+-------------+------+
#|col_1|col_2        |col_3 |
#+-----+-------------+------+
#|a    |1572393600000| 1.000|
#|a    |1572480000000| 1.007|
#|b    |1572825600000| 1.002|
#|b    |1572912000000| 1.000|
#+-----+-------------+------+

Explanation:

  1. trim(both '][' from json) : removes trailing and leading caracters [ and ], get someting like: 1572393600000, 1.000],[1572480000000, 1.007

  2. Now you can split by ],[ (\\\ is for escaping the brackets)

  3. transform takes the array from the split and for each element, it splits by comma and creates struct col_2 and col_3

  4. explode the array of structs you get from the transform and star expand the struct column

blackbishop
  • 30,945
  • 11
  • 55
  • 76
  • thanks for you answer, for some reason I get exception, any idea how to solve it ? ```AnalysisException: Can only star expand struct data types. Attribute: `ArrayBuffer(json)`;``` – Dan Feb 08 '21 at 20:22
  • That works well, could you also help what F.expr("""transform does , what is "'\\\" etc ... – Dan Feb 08 '21 at 20:37
  • Last thing if you could help, if there is url like "https://url&id=ID1234]2]0]" and I would like to extract "ID1234" (id has always 6 characters), any idea how to do it? – Dan Feb 08 '21 at 20:59
  • 1
    @Dan you can use regexp_extract function : `F.regexp_extract(F.col("url"), ".+(ID\\d{4}).+", 1)` – blackbishop Feb 08 '21 at 21:04
2
df.select(df.url, F.explode(F.from_json(df.json,"array<string>")))
.select("url",F.from_json((F.col("col")),"array<string>").alias("col"))
.select("url",F.col("col").getItem(0),F.col("col").getItem(1))
.show(truncate=False)

+-------------+-------------+------+
|url          |col[0]       |col[1]|
+-------------+-------------+------+
|https://url.a|1572393600000|1.0   |
|https://url.a|1572480000000|1.007 |
|https://url.b|1572825600000|1.002 |
|https://url.b|1572912000000|1.0   |
+-------------+-------------+------+
chlebek
  • 2,431
  • 1
  • 8
  • 20
  • Thanks, there is in this case always only 2 values in each pair. This works well! – Dan Feb 08 '21 at 20:49
0
df.select(col("url"), 
          explode(from_json(col("json"), ArrayType(StringType))))      
  .select(col("url"), 
          from_json(col("col"), ArrayType(StringType)).alias("col"))      
  .select(col("url"),
          col("col").getItem(0).alias("code"),        
          col("col").getItem(1).alias("value"))
  • 1
    Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Aug 11 '23 at 20:15