0

I am trying to set up a demo Glue job that demonstrates upserts using data lake framework.

I have example full load data I have saved as delta table in S3 bucket defined as follows:

data = {'visitor': ['foo', 'bar', 'baz'], 
         'id': [1, 2, 3],
         'B': [1, 0, 1],
         'C': [1, 0, 0]}  

and example incremental data with same setup but in different S3 prefix defined as follows:

data_updated = {'visitor': ['foo_updated'], 
        'id': [1],
        'B': [1],
        'C': [1]} 

After executing the following statements:


delta_df = DeltaTable.forPath(spark, "s3://example_bucket/full_load")
cdc_df = spark.read.format("delta").load("s3://example_bucket/incremental_load/")

final_df = delta_df.alias("prev_df").merge( \
source = cdc_df.alias("append_df"), \
#matching on primarykey
condition = expr("prev_df.id = append_df.id"))\
.whenMatchedUpdate(set= {
    "prev_df.B"           : col("append_df.B"), 
    "prev_df.C"         : col("append_df.C"),
    "prev_df.visitor"        : col("append_df.visitor")} )\
.whenNotMatchedInsert(values =
#inserting a new row to Delta table
{   "prev_df.B"             : col("append_df.B"),
    "prev_df.C"           : col("append_df.C"), 
    "prev_df.visitor"         : col("append_df.visitor"),
})\
.execute()

Tables in s3 were created as follows:

    df = pd.DataFrame(data)
    dataFrame = spark.createDataFrame(df)
    dataFrame.write \
        .format("delta") \
        .mode("overwrite") \
        .save("s3://example_bucket/full_load", overwrite=True)
    
    df = pd.DataFrame(data)
    dataFrame = spark.createDataFrame(df)
    dataFrame.write \
        .format("delta") \
        .mode("overwrite") \
        .save("s3://example_bucket/incremental_load", overwrite=True)

Incremental row with id 1 gets appended into the full load bucket instead of the original row getting updated. What am I doing wrong?

huehue
  • 50
  • 5
  • 1
    Can we cross-check sample data in both tables? `delta_df.toDF().show()` and `cdc_df.show()` – lsc May 11 '23 at 14:10
  • @lsc delta_df.toDF().show() -> ```+-------+---+---+---+ |visitor| id| B| C| +-------+---+---+---+ | baz| 3| 1| 0| | bar| 2| 0| 0| | foo| 1| 1| 1| +-------+---+---+---+``` cdc_df.show() -> ```+-----------+---+---+---+ | visitor| id| B| C| +-----------+---+---+---+ |foo_updated| 1| 1| 1| +-----------+---+---+---+``` – huehue May 11 '23 at 14:33

1 Answers1

1

The overall flow should be:

  1. Enable Delta Lake for AWS Glue
  2. Create the two delta tables in Glue data catalog using dataframewriter API.
  3. Do not run any Glue crawler, or other APIs which update the Glue data catalog.
  4. Merge
  5. Query in Athena.

Create and write a delta table:

dataFrame.write \
    .format("delta") \
    .mode("overwrite") \
    .option("path", s3_path) \
    .saveAsTable(f"{database_name}.{table_name})

Merge:

delta_df.alias("prev_df").merge(
    cdc_df.alias("append_df"),
    # matching on primarykey
    "prev_df.id = append_df.id") \
    .whenMatchedUpdate(set= {
        "B": "append_df.B",
        "C": "append_df.C",
        "visitor": "append_df.visitor"
    })\
    .whenNotMatchedInsert(values = {
    #inserting a new row to Delta table
        "id": "append_df.id",
        "B": "append_df.B",
        "C": "append_df.C", 
        "visitor": "append_df.visitor"
    })\
    .execute()
lsc
  • 235
  • 1
  • 9
  • Still the same. I get this at the full load location when querying with Athena: # id visitor b c 1 3 baz 1 0 4 2 bar 0 0 2 1 foo_updated 1 1 3 1 foo 1 1 – huehue May 11 '23 at 14:46
  • Let's try to isolate the issue one at a time. After the merge, what does `delta_df.toDF().show()` display? – lsc May 11 '23 at 15:19
  • The output is as expected actually! It displays the merged row. +-----------+---+---+---+ | visitor| id| B| C| +-----------+---+---+---+ |foo_updated| 1| 1| 1| | bar| 2| 0| 0| | baz| 3| 1| 0| +-----------+---+---+---+ – huehue May 11 '23 at 16:27
  • 1
    @huehue So the issue should be how you create the database and/or table in Glue in the first place. Can you describe that in your original question? – lsc May 11 '23 at 16:43
  • Great. Thanks. Extended the question. In the real world example, sample dataframes would be replaced with data from an actual data source – huehue May 11 '23 at 17:00
  • @huehue edited the answer to include overall flow and delta table creation. – lsc May 12 '23 at 10:48
  • Do you perhaps know why the created table in Glue catalog contains ```__PLACEHOLDER__``` as the suffix for the table location? – huehue May 16 '23 at 14:50
  • I cannot query it directly in Athena as I get the following exception: ```COLUMN_NOT_FOUND: line 1:8: SELECT * not allowed from relation that has no columns``` – huehue May 16 '23 at 15:03
  • No idea. Does this SO apply to you? https://stackoverflow.com/questions/69474624/why-is-my-glue-table-creating-with-the-wrong-path – lsc May 16 '23 at 15:30
  • It's the same issue but I have only those simple few columns in the sample data, and I have already tried purging the S3 location, deleting tables from the catalog and reruning the script but to no avail – huehue May 16 '23 at 16:08