0

I have a delta table

# Load the data from its source.
df = spark.read.load("/databricks-datasets/learning-spark-v2/people/people-10m.delta")

# Write the data to a table.
table_name = "people_10m"
df.write.saveAsTable(table_name)

I now have a schema change that I want to add, maybe a single column, maybe a few columns, maybe nested arrays. I can't predict what will come up in the code execution.

I used python's set API to find the new columns, and now I want to add them to the delta table. Ideally, using python API.

One thought was to modify the schema of the Dataframe and then somehow tell the table to match. I'm using python's set API to find new columns. I don't want to read the whole dataset and write it, I don't want to kill the history also. I would be ok with schema evolution if it's possible to do it without any data (just schema update) and stop all column deletions.

Brian
  • 848
  • 10
  • 32
  • Have you tried the ALTER statement? You can execute this SQL from Python using spark.sql("ALTER ... "), https://docs.delta.io/latest/delta-batch.html#add-columns – Nick Karpov Oct 24 '22 at 16:43
  • Thats sql not python. What i got working was to append an empty table with the new schema with schema evolution enabled. – Brian Oct 25 '22 at 04:46

1 Answers1

0

The solution that worked was create an empty df with the new schema ( no rem columns only add) them append to the table but write with schema evolution.

Brian
  • 848
  • 10
  • 32