0

I am following the snowflake schema. I have a fact table with 9 dimension tables for a redshift warehouse. My ETL workflow is done in was glue. I have simplified the problem to just two tables (one dimension table and one fact).

The issue I am having is each time I run my ETL I get different mapping results and think it has to do with the way I am joining the physical_attribute table to the animal table

I extract data from a source table in a data catalog called cat.

cat_df = glueContext.create_dynamic_frame.from_catalog(database="animal_parquet", table_name="cat").toDF()

Next, I set up my mapping function I have done it two ways. Here is the data being mapped within a dict:

animal_color_map = {
    "r": "red",
    "w": "white",
    "bl": "black",
    "br": "brown",
}
  1. F.create_map This is the way I will be doing it if I can figure out the other issues.
animal_color_mapper = F.create_map(
    [F.lit(x) for x in chain(*animal_color_map.items())])
  1. F.udf
@F.udf
def animal_color_mapper2(x):
    return animal_color_map.get(x, "NOT SPECIFIED")

Okay now to perform the creation of the data frames, joining of the data from the dimensions table back to that fact table, and evaluation.

** I believe the issue is from a pyspark UDF-like function being non-deterministic, but can't find the answer in the docs to turn them into deterministic functions**

  
physical_attribute_df = (
    cat_df.select(

        F.coalesce(
            animal_color_mapper[cat_df.fur_color],
            F.lit("NOT SPECIFIED")
        ).cast('string').alias('color'),  # color
        
        F.coalesce(
            animal_color_mapper2(cat_df.fur_color),
            F.lit("NOT SPECIFIED")
        ).cast('string').alias('color2'),  # color2
            
        cat_df.zoo_id.alias('_zoo_id'),

    ).groupBy(
        'color',
        'color2',
    ).agg(
        F.collect_list('_zoo_id').alias('_zoo_ids')
    ).coalesce(1).withColumn(
        'id', F.monotonically_increasing_id() # id
    ).withColumn(
        'created_date', F.current_timestamp() # create_date
    ).withColumn(
        'last_updated', F.current_timestamp() # last_updated
    )

)

This table at this point looks like this:

+------------+-------------+--------------------+---+--------------------+--------------------+
|       color|       color2|            _app_ids| id|        created_date|        last_updated|
+------------+-------------+--------------------+---+--------------------+--------------------+
|       black|        black|[10447643, 104525...|  0|2021-11-11 18:38:...|2021-11-11 18:38:...|
|       brown|        brown|[10450650, 104551...|  1|2021-11-11 18:38:...|2021-11-11 18:38:...|
|       white|        white|[10445953, 104470...|  2|2021-11-11 18:38:...|2021-11-11 18:38:...|
|         red|          red|[10453690, 104547...|  3|2021-11-11 18:38:...|2021-11-11 18:38:...|
+------------+-------------+--------------------+---+--------------------+--------------------+

Now notice the groupby to agg because this is used to join the data to the fact table after I explode the data.

_physical_attributes_df = physical_attribute_df.select(
    F.explode(physical_attribute_df._zoo_ids).alias('zoo_id'),
    physical_attribute_df.id.alias('physical_attribute_id'),
)

This table at this point looks like this:

+--------+---------------------+
|  zoo_id|physical_attribute_id|
+--------+---------------------+
|10447643|                    0|
|10452584|                    0|
|10453651|                    0|
|10448127|                    0|
|10444833|                    0|
+--------+---------------------+

Now that I have this look up table I drop the _zoo_ids column.

physical_attribute_df = physical_attribute_df.drop('_zoo_ids')

This table at this point looks like this:

+------------+-------------+---+--------------------+--------------------+
|       color|       color2| id|        created_date|        last_updated|
+------------+-------------+---+--------------------+--------------------+
|       black|        black|  0|2021-11-11 18:38:...|2021-11-11 18:38:...|
|       brown|        brown|  1|2021-11-11 18:38:...|2021-11-11 18:38:...|
|       white|        white|  2|2021-11-11 18:38:...|2021-11-11 18:38:...|
|         red|          red|  3|2021-11-11 18:38:...|2021-11-11 18:38:...|
+------------+-------------+---+--------------------+--------------------+

Here is my final step. joining the table to the fact table

animal_df = (

    cat_df.join(
        _physical_attributes_ids_df,
        _physical_attributes_ids_df.zoo_id == cat_df.zoo_id,
        'left'
    ).select(

        F.current_timestamp().alias('created_date'), # create_date
        F.current_timestamp().alias('last_updated'), # last_updated

        cat_df.zoo_id.alias('zoo_id'),  # zoo_id
   
        _physical_attributes_ids_df.physical_attributes_id.alias('physical_attributes_id'), # physical_attributes_id
  
    ).coalesce(1).withColumn(
        'id', F.monotonically_increasing_id() # id
    )
)

When I validate, the data is different for every ETL run I do with the mapping.

Justin B.
  • 31
  • 5
  • Can you confirm that it is the `monotonically_increasing_id` generated columns that have different data? – Nithish Nov 11 '21 at 20:01
  • Yes, I think you are correct. `monotonically_increasing_id` generated columns that have different data. Not sure how to check but yes after looking at the function in the docs it is non-deterministic. Is there a better way to create ids? for a look-up table? or a better way of ordering things? – Justin B. Nov 11 '21 at 20:33
  • After `coalesce` ordering by `color` should make then generated ID deterministic. For the `animal_df` then you can order by `physical_attributes_id` for determinism. – Nithish Nov 11 '21 at 20:50
  • Thanks. Does my answer sound okay? The solution I, with your help, created works. Just don't know how to word it best for others later. – Justin B. Nov 11 '21 at 21:08

1 Answers1

1

The issue was that 'id', F.monotonically_increasing_id() # id after coalesce is non-deterministic (who would of know), as coalesce causes merging of partitions without ordering guarantees.

In such cases, explicit ordering will solve the problem.

What I used to solve the issue was my table's identity fields in this case 'color'`` within F.row_number().over(Window.orderBy(physical_attribute_df.color))

physical_attribute_df = (
    cat_df.select(

        F.coalesce(
            animal_color_mapper[cat_df.fur_color],
            F.lit("NOT SPECIFIED")
        ).cast('string').alias('color'),  # color
       
        cat_df.zoo_id.alias('_zoo_id'),

    ).groupBy(
        'color',
        'color2',
    ).agg(
        F.collect_list('_zoo_id').alias('_zoo_ids')
    )

physical_attribute_df = (
    physical_attribute_df.coalesce(1).withColumn(
        'id',
        F.row_number().over(Window.orderBy(physical_attribute_df.color))
    ).withColumn(
        'created_date', F.current_timestamp() # create_date
    ).withColumn(
        'last_updated', F.current_timestamp() # last_updated
    )
)

_physical_attributes_df = physical_attribute_df.select(
    F.explode(physical_attribute_df._zoo_ids).alias('zoo_id'),
    physical_attribute_df.id.alias('physical_attribute_id'),
)


physical_attribute_df = physical_attribute_df.drop('_zoo_ids')

animal_df = (

    cat_df.join(
        _physical_attributes_ids_df,
        _physical_attributes_ids_df.zoo_id == cat_df.zoo_id,
        'left'
    ).select(

        F.current_timestamp().alias('created_date'), # create_date
        F.current_timestamp().alias('last_updated'), # last_updated

        cat_df.zoo_id.alias('zoo_id'),  # zoo_id
   
        _physical_attributes_ids_df.physical_attributes_id.alias('physical_attributes_id'), # physical_attributes_id

    )
)


animal_df = (

    animal_df.coalesce(1).withColumn(
        'id',
         F.row_number().over(Window.orderBy(animal_df.zoo_id))
    )
)
Nithish
  • 3,062
  • 2
  • 8
  • 16
Justin B.
  • 31
  • 5