0

Consider the example below:

_schema = ['num_col', 'word']
_data = [
    (1, 'idA'), (2, 'idA'), (3, 'idB'), (4, 'idC'), (5, 'idC'),
    (1, 'idC'), (2, 'idC'),
]
df = spark.createDataFrame(_data, _schema)

out_path = '/tmp/output.delta'
_ = (
    df
    .coalesce(1)
    .write
    .partitionBy('num_col')
    .format('delta')
    .save(out_path)
)
log_df = spark.read.json(f'{out_path}/_delta_log/*.json')
cols = [
    F.col('add.path').substr(0, 22).alias('file_name'),
    F.expr('TIMESTAMP_MILLIS(add.modificationTime)').alias('modified'),
]
log_df.where('add IS NOT NULL').select(cols).show(10, False)

+----------------------+-----------------------+
|file_name             |modified               |
+----------------------+-----------------------+
|num_col=1/part-00000-5|2023-07-29 10:36:13.622|
|num_col=2/part-00000-b|2023-07-29 10:36:13.63 |
|num_col=3/part-00000-8|2023-07-29 10:36:13.638|
|num_col=4/part-00000-0|2023-07-29 10:36:13.642|
|num_col=5/part-00000-9|2023-07-29 10:36:13.658|
+----------------------+-----------------------+

I ran this multiple times with different number of partitions and the modified time always increases with the partition values. So my assumption is: using coalesce(1) with partitionBy('columnX') writes the files sequentially and the order is determined by columnX (increasing order). Is the assumption correct?

boyangeor
  • 381
  • 3
  • 6

0 Answers0