I am following the snowflake schema. I have a fact table with 9 dimension tables for a redshift warehouse. My ETL workflow is done in was glue. I have simplified the problem to just two tables (one dimension table and one fact).
The issue I am having is each time I run my ETL I get different mapping results and think it has to do with the way I am joining the physical_attribute
table to the animal
table
I extract data from a source table in a data catalog called cat
.
cat_df = glueContext.create_dynamic_frame.from_catalog(database="animal_parquet", table_name="cat").toDF()
Next, I set up my mapping function I have done it two ways. Here is the data being mapped within a dict:
animal_color_map = {
"r": "red",
"w": "white",
"bl": "black",
"br": "brown",
}
F.create_map
This is the way I will be doing it if I can figure out the other issues.
animal_color_mapper = F.create_map(
[F.lit(x) for x in chain(*animal_color_map.items())])
F.udf
@F.udf
def animal_color_mapper2(x):
return animal_color_map.get(x, "NOT SPECIFIED")
Okay now to perform the creation of the data frames, joining of the data from the dimensions table back to that fact table, and evaluation.
** I believe the issue is from a pyspark UDF-like function being non-deterministic, but can't find the answer in the docs to turn them into deterministic functions**
physical_attribute_df = (
cat_df.select(
F.coalesce(
animal_color_mapper[cat_df.fur_color],
F.lit("NOT SPECIFIED")
).cast('string').alias('color'), # color
F.coalesce(
animal_color_mapper2(cat_df.fur_color),
F.lit("NOT SPECIFIED")
).cast('string').alias('color2'), # color2
cat_df.zoo_id.alias('_zoo_id'),
).groupBy(
'color',
'color2',
).agg(
F.collect_list('_zoo_id').alias('_zoo_ids')
).coalesce(1).withColumn(
'id', F.monotonically_increasing_id() # id
).withColumn(
'created_date', F.current_timestamp() # create_date
).withColumn(
'last_updated', F.current_timestamp() # last_updated
)
)
This table at this point looks like this:
+------------+-------------+--------------------+---+--------------------+--------------------+
| color| color2| _app_ids| id| created_date| last_updated|
+------------+-------------+--------------------+---+--------------------+--------------------+
| black| black|[10447643, 104525...| 0|2021-11-11 18:38:...|2021-11-11 18:38:...|
| brown| brown|[10450650, 104551...| 1|2021-11-11 18:38:...|2021-11-11 18:38:...|
| white| white|[10445953, 104470...| 2|2021-11-11 18:38:...|2021-11-11 18:38:...|
| red| red|[10453690, 104547...| 3|2021-11-11 18:38:...|2021-11-11 18:38:...|
+------------+-------------+--------------------+---+--------------------+--------------------+
Now notice the groupby
to agg
because this is used to join the data to the fact table after I explode the data.
_physical_attributes_df = physical_attribute_df.select(
F.explode(physical_attribute_df._zoo_ids).alias('zoo_id'),
physical_attribute_df.id.alias('physical_attribute_id'),
)
This table at this point looks like this:
+--------+---------------------+
| zoo_id|physical_attribute_id|
+--------+---------------------+
|10447643| 0|
|10452584| 0|
|10453651| 0|
|10448127| 0|
|10444833| 0|
+--------+---------------------+
Now that I have this look up table I drop the _zoo_ids
column.
physical_attribute_df = physical_attribute_df.drop('_zoo_ids')
This table at this point looks like this:
+------------+-------------+---+--------------------+--------------------+
| color| color2| id| created_date| last_updated|
+------------+-------------+---+--------------------+--------------------+
| black| black| 0|2021-11-11 18:38:...|2021-11-11 18:38:...|
| brown| brown| 1|2021-11-11 18:38:...|2021-11-11 18:38:...|
| white| white| 2|2021-11-11 18:38:...|2021-11-11 18:38:...|
| red| red| 3|2021-11-11 18:38:...|2021-11-11 18:38:...|
+------------+-------------+---+--------------------+--------------------+
Here is my final step. joining the table to the fact table
animal_df = (
cat_df.join(
_physical_attributes_ids_df,
_physical_attributes_ids_df.zoo_id == cat_df.zoo_id,
'left'
).select(
F.current_timestamp().alias('created_date'), # create_date
F.current_timestamp().alias('last_updated'), # last_updated
cat_df.zoo_id.alias('zoo_id'), # zoo_id
_physical_attributes_ids_df.physical_attributes_id.alias('physical_attributes_id'), # physical_attributes_id
).coalesce(1).withColumn(
'id', F.monotonically_increasing_id() # id
)
)
When I validate, the data is different for every ETL run I do with the mapping.