0

I have to build a Glue Job for updating and deleting old rows in Athena table. When I run my job for deleting it returns an error:

AnalysisException: 'Unable to infer schema for Parquet. It must be specified manually.;'

My Glue Job:

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "test-database", table_name = "test_table", transformation_ctx = "datasource0")
datasource1 = glueContext.create_dynamic_frame.from_catalog(database = "test-database", table_name = "test_table_output", transformation_ctx = "datasource1")

datasource0.toDF().createOrReplaceTempView("view_dyf")
datasource1.toDF().createOrReplaceTempView("view_dyf_output")

ds = spark.sql("SELECT * FROM view_dyf_output where id in (select id from view_dyf where op like 'D')")

hudi_delete_options = {
  'hoodie.table.name': 'test_table_output',
  'hoodie.datasource.write.recordkey.field': 'id',
  'hoodie.datasource.write.table.name': 'test_table_output',
  'hoodie.datasource.write.operation': 'delete',
  'hoodie.datasource.write.precombine.field': 'name',
  'hoodie.upsert.shuffle.parallelism': 1, 
  'hoodie.insert.shuffle.parallelism': 1
}

from pyspark.sql.functions import lit
deletes = list(map(lambda row: (row[0], row[1]), ds.collect()))
df = spark.sparkContext.parallelize(deletes).toDF(['id']).withColumn('name', lit(0.0))

df.write.format("hudi"). \
  options(**hudi_delete_options). \
  mode("append"). \
  save('s3://data/test-output/')



roAfterDeleteViewDF = spark. \
  read. \
  format("hudi"). \
  load("s3://data/test-output/") 
roAfterDeleteViewDF.registerTempTable("test_table_output")

spark.sql("SELECT * FROM view_dyf_output where id in (select distinct id from view_dyf where op like 'D')").count()  

I have 2 data sources; first old Athena table where data has to updated or deleted, and the second table in which are coming new updated or deleted data.

In ds I have selected all rows that have to be deleted in old table.

op is for operation; 'D' for delete, 'U' for update.

Does anyone know what am I missing here?

Mateja K
  • 57
  • 2
  • 12

1 Answers1

2

The value for hoodie.datasource.write.operation is invalid in your code, the supported write operations are: UPSERT/Insert/Bulk_insert. check Hudi Doc.

Also what is your intention for deleting records: hard delete or soft ? For Hard delete, you have to provide {'hoodie.datasource.write.payload.class': 'org.apache.hudi.common.model.EmptyHoodieRecordPayload}

mtami
  • 61
  • 1
  • 3
  • I have resolved this problem. The code was fine, the problem was in the DDL of the Athena Hudi table. – Mateja K Jul 09 '21 at 08:25