-1

I have a dataframe which has 2 columns" "ID" and "input_array" (values are JSON arrays).

ID   input_array
1    [ {“A”:300, “B”:400}, { “A”:500,”B”: 600} ]
2    [ {“A”: 800, “B”: 900} ]

Output that I need:

ID A      B
1  300    400
1  500    600
2  800    900

I tried from_json, explode functions. But data type mismatch error is coming for array columns.

Real data image
In the image, the 1st dataframe is the input dataframe which I need to read and convert to the 2nd dataframe. 3 input rows needs to be converted to 5 output rows.

ZygD
  • 22,092
  • 39
  • 79
  • 102

3 Answers3

0

You can remove square brackets by using regexp_replace or substring functions
Then you can transform strings with multiple jsons to an array by using split function
Then you can unwrap the array and make new row for each element in the array by using explode function
Then you can handle column with json by using from_json function

Doc: pyspark.sql.functions

Artem Astashov
  • 646
  • 6
  • 10
0

If Input_array is string then you need to parse this string as a JSON and then explode it into rows and expand the keys to columns. You can parse the array as using ArrayType data structure:

from pyspark.sql.types import *
from pyspark.sql import functions as F


data = [('1', '[{"A":300, "B":400},{ "A":500,"B": 600}]')
       ,('2', '[{"A": 800, "B": 900}]')
       ]

my_schema = ArrayType(
                      StructType([
                          StructField('A', IntegerType()),
                          StructField('B', IntegerType())
                      ])
                      )
    
df = spark.createDataFrame(data, ['id', 'Input_array'])\
    .withColumn('Input_array', F.from_json('Input_array', my_schema))\
    .select("id", F.explode("Input_array").alias("Input_array"))\
    .select("id", F.col('Input_array.*'))

df.show(truncate=False)

# +---+---+---+
# |id |A  |B  |
# +---+---+---+
# |1  |300|400|
# |1  |500|600|
# |2  |800|900|
# +---+---+---+
Luiz Viola
  • 2,143
  • 1
  • 11
  • 30
0

I have 2 interpretations of what input (column "input_array") data types you have.

  • If it's a string...

    df = spark.createDataFrame(
        [(1, '[ {"A":300, "B":400}, { "A":500,"B": 600} ]'),
         (2, '[ {"A": 800, "B": 900} ]')],
        ['ID', 'input_array'])
    df.printSchema()
    # root
    #  |-- ID: long (nullable = true)
    #  |-- input_array: string (nullable = true)
    

    ...you can use from_json to extract Spark structure from JSON string and then inline to explode the resulting array of structs into columns.

    df = df.selectExpr(
        "ID",
        "inline(from_json(input_array, 'array<struct<A:long,B:long>>'))"
    )
    df.show()
    # +---+---+---+
    # | ID|  A|  B|
    # +---+---+---+
    # |  1|300|400|
    # |  1|500|600|
    # |  2|800|900|
    # +---+---+---+
    
  • If it's an array of strings...

    df = spark.createDataFrame(
        [(1, [ '{"A":300, "B":400}', '{ "A":500,"B": 600}' ]),
         (2, [ '{"A": 800, "B": 900}' ])],
        ['ID', 'input_array'])
    df.printSchema()
    # root
    #  |-- ID: long (nullable = true)
    #  |-- input_array: array (nullable = true)
    #  |    |-- element: string (containsNull = true)
    

    ...you can first use explode to move every array's element into rows thus resulting in a column of string type, then use from_json to create Spark data types from the strings and finally expand * the structs into columns.

    from pyspark.sql import functions as F
    
    df = df.withColumn('input_array', F.explode('input_array'))
    df = df.withColumn('input_array', F.from_json('input_array', 'struct<A:long,B:long>'))
    df = df.select('ID', 'input_array.*')
    
    df.show()
    # +---+---+---+
    # | ID|  A|  B|
    # +---+---+---+
    # |  1|300|400|
    # |  1|500|600|
    # |  2|800|900|
    # +---+---+---+
    
ZygD
  • 22,092
  • 39
  • 79
  • 102