1

Sample: I have a partitioned table with DDL below in Glue catalog:

CREATE EXTERNAL TABLE `test`(
  `id` int, 
  `data` struct<a:string,b:string>)
PARTITIONED BY ( 
  `partition_0` string)
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 

The underlying data in S3 is json files with varying schema meaning that some elements may not exist in some files but exist in other.

In the this sample partition_0='01' contains json file with all elements:

{"id": 1,"data": {"a": "value-a", "b": "value-b"}}

The file in partition_0='02' does not contain element data.b:

{"id": 1,"data": {"a": "value-a"}}

Issue: When I create DynamicFrame in Glue (I use Python), its schema depends on the data that I query. If I include the data from partition_0='01' then all elements are present in the schema.

id_partition_predicate="partition_0 = '01'"
print("partition with 'b'")
glueContext.create_dynamic_frame.from_catalog(database = glue_source_database, table_name = "test", push_down_predicate = id_partition_predicate).printSchema()

partition with 'b'
root
|-- id: int
|-- data: struct
|    |-- a: string
|    |-- b: string
|-- partition_0: string
print("both partitions")
glueContext.create_dynamic_frame.from_catalog(database = glue_source_database, table_name = "test").printSchema()

both partitions
root
|-- id: int
|-- data: struct
|    |-- a: string
|    |-- b: string
|-- partition_0: string

If I query only data from partition_0='02' then element data.b does not exist in the DynamicFrame schema even though it exists in the table definition.

print("partition without 'b'")
id_partition_predicate="partition_0 = '02'"
glueContext.create_dynamic_frame.from_catalog(database = glue_source_database, table_name = "test", push_down_predicate = id_partition_predicate).printSchema()


partition without 'b'
root
|-- id: int
|-- data: struct
|    |-- a: string
|-- partition_0: string

Question: How create DynamicFrame or DataFrame that always contains all elements from the Glue table's schema?

Thanks in advance!

Leonid
  • 11
  • 2

1 Answers1

0

Came up with this solution:

id_partition_predicate="partition_0 = '02'"
dyf = glueContext.create_dynamic_frame.from_catalog(database = glue_source_database, table_name = "test", push_down_predicate = id_partition_predicate)
dyf.printSchema()

df=dyf.toDF()
try:  
    df = df.withColumn("b", col("data").getItem("b"))
except:
    df = df.withColumn("b", lit(None).cast(StringType()))
df.show()

Output:

root
|-- id: int
|-- data: struct
|    |-- a: string
|-- partition_0: string

+---+---------+-----------+----+
| id|     data|partition_0|   b|
+---+---------+-----------+----+
|  1|[value-a]|         02|null|
+---+---------+-----------+----+
Leonid
  • 11
  • 2