1

Run out of ideas on how to solve the following issue. A table in the Glue data catalog has this schema:

root
|-- _id: string
|-- _field: struct
|    |-- ref: choice
|    |    |-- array
|    |    |    |-- element: struct
|    |    |    |    |-- value: null
|    |    |    |    |-- key: string
|    |    |    |    |-- name: string
|    |    |-- struct
|    |    |    |-- value: null
|    |    |    |-- key: choice
|    |    |    |    |-- int
|    |    |    |    |-- string
|    |    |    |-- name: string

If I try to resolve the ref choice using

resolved = (
     df.
        resolveChoice(
            specs = [('_field.ref','cast:array')]
        )
)

I lose records.

Any ideas on how I could:

  1. filter the DataFrame on whether _field.ref is an array or struct
  2. convert struct records into an array or vice-versa
CPak
  • 13,260
  • 3
  • 30
  • 48

1 Answers1

3

I was able to solve my own problem by using

resolved_df = ResolveChoice.apply(df, choice = "make_cols")

This will save array values in a new ref_array column and struct values in ref_struct column.

This allowed me to split the DataFrame by

resolved_df1 = resolved_df.filter(col("ref_array").isNotNull()).select(col("ref_array").alias("ref"))

resolved_df2 = resolved_df.filter(col("ref_struct").isNotNull()).select(col("ref_struct").alias("ref"))

After either converting the array to structs only (using explode()) or converting structs to an array using array(), recombine them

CPak
  • 13,260
  • 3
  • 30
  • 48