0

This is my schema

my_schema = StructType([
    StructField('uid', StringType(), True),
    StructField('test_id', StringType(), True),
    StructField("struct_ids", ArrayType(
        StructType([
                StructField("st", IntegerType(), True),
                StructField("mt", IntegerType(), True),
            ])
        ) )
 ])

this is my data

my_data = {'table_test': {'uid': 'test',
                          'test_id': 'test',
                          'struct_ids': [{'st': 1234, 'mt': 1111}, {'st': 6789, 'mt': 2222}]}}

This is how I create a dataframe and it works.

df = spark.createDataFrame(data=[my_data['table_test']], schema=my_schema)

How to create multiple rows? eg: Add this row to the table during creation of table or later.

{'uid': 'test2',
 'test_id': 'test2',
 'struct_ids': [{'st': 3333, 'mt': 114411}, {'st': 333, 'mt': 444}]}

Creating an array did not work.

Blue Clouds
  • 7,295
  • 4
  • 71
  • 112
  • what is wrong with the current code? "Add this row to the table during creation of table" this is not clear to me. please show the expected dataframe in table format. – Emma Aug 23 '23 at 19:48

2 Answers2

0

Explode the array of structs

df.select('*', F.inline('struct_ids')).drop('struct_ids')

+----+-------+----+----+
| uid|test_id|  st|  mt|
+----+-------+----+----+
|test|   test|1234|1111|
|test|   test|6789|2222|
+----+-------+----+----+
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
  • perhaps my question was not clear, I have edited the question. It is not about struct array but the whole row of the table – Blue Clouds Aug 23 '23 at 18:21
0

If your required output is thisenter image description here

use this while creation of table

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, ArrayType, IntegerType
spark = SparkSession.builder.appName("SplitDataExample").getOrCreate()
my_schema = StructType([
    StructField('uid', StringType(), True),
    StructField('test_id', StringType(), True),
    StructField("struct_ids", ArrayType(
        StructType([
            StructField("st", IntegerType(), True),
            StructField("mt", IntegerType(), True),
        ])
    ))
])
my_data = {
    'table_test': {
        'uid': 'test',
        'test_id': 'test',
        'struct_ids': [{'st': 1234, 'mt': 1111}, {'st': 6789, 'mt': 2222}]
    }
}
uid = my_data['table_test']['uid']
test_id = my_data['table_test']['test_id']
struct_ids = my_data['table_test']['struct_ids']
rows = [(uid, test_id, [struct]) for struct in struct_ids]
df = spark.createDataFrame(rows, schema=my_schema)
df.show(truncate=False)
spark.stop()