Delta Lake Spark compaction after merge operation gives 'DeltaTable' object has no attribute '_get_object_id' error

Question

I am doing delta-lake merge operation using python api and pyspark . After doing the merge operation I call the compaction operation but the compaction gives the following error:

Error:

  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 170, in load
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1248, in __call__
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1212, in _build_args
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1199, in _get_args
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_collections.py", line 501, in convert
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1248, in __call__
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1218, in _build_args
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1218, in <listcomp>
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 298, in get_command_part
AttributeError: 'DeltaTable' object has no attribute '_get_object_id'

Code

delta_table = "delta_lake_path"

df = spark.read.csv("s3n://input_file.csv",header=True)

delta_table = DeltaTable.forPath(spark, delta_table)

delta_table.merge(df, "df.id = delta_table.id" ).whenNotMatchedInsertAll().execute()

#compaction
spark.read.format("delta").load(delta_table).repartition(1).write.option("dataChange",
"False").format("delta").mode("overwrite").save(delta_table)

Can anyone suggest me why the spark session is not able to create another delta table instance . I need to perform both merge and compaction in the same script since I want to run the compaction only on the partitions in which the merge is performed . The partitions are derived from the unique values present in the dataframe df created from input_file.csv

score 0 · Answer 1 · answered Jun 10 '20 at 15:52

I think your problem lies with delta_table variable - at first it is a string containing delta lake path, but then you are creating a delta table object trying to pass it into .load() method. Separating those variables could help:

delta_table_path = "delta_lake_path"

df = spark.read.csv("s3n://input_file.csv",header=True)

delta_table = DeltaTable.forPath(spark, delta_table_path)

delta_table.merge(df, "df.id = delta_table.id" ).whenNotMatchedInsertAll().execute()

#compaction
spark.read.format("delta").load(delta_table_path).repartition(1).write.option("dataChange",
"False").format("delta").mode("overwrite").save(delta_table_path )

Thanks @matkurek , It worked , sometimes you overlook those silly mistakes... — priyansh jain, Jun 10 '20 at 21:25

Delta Lake Spark compaction after merge operation gives 'DeltaTable' object has no attribute '_get_object_id' error

1 Answers1