Consider the example below:
_schema = ['num_col', 'word']
_data = [
(1, 'idA'), (2, 'idA'), (3, 'idB'), (4, 'idC'), (5, 'idC'),
(1, 'idC'), (2, 'idC'),
]
df = spark.createDataFrame(_data, _schema)
out_path = '/tmp/output.delta'
_ = (
df
.coalesce(1)
.write
.partitionBy('num_col')
.format('delta')
.save(out_path)
)
log_df = spark.read.json(f'{out_path}/_delta_log/*.json')
cols = [
F.col('add.path').substr(0, 22).alias('file_name'),
F.expr('TIMESTAMP_MILLIS(add.modificationTime)').alias('modified'),
]
log_df.where('add IS NOT NULL').select(cols).show(10, False)
+----------------------+-----------------------+
|file_name |modified |
+----------------------+-----------------------+
|num_col=1/part-00000-5|2023-07-29 10:36:13.622|
|num_col=2/part-00000-b|2023-07-29 10:36:13.63 |
|num_col=3/part-00000-8|2023-07-29 10:36:13.638|
|num_col=4/part-00000-0|2023-07-29 10:36:13.642|
|num_col=5/part-00000-9|2023-07-29 10:36:13.658|
+----------------------+-----------------------+
I ran this multiple times with different number of partitions and the modified time always increases with the partition values. So my assumption is: using coalesce(1)
with partitionBy('columnX')
writes the files sequentially and the order is determined by columnX
(increasing order). Is the assumption correct?