Say I have a table called data
and it's some time-series. It's stored like this:
/data
/date=2022-11-30
/region=usa
part-000001.parquet
part-000002.parquet
Where I have two partition keys and two partitions for the parquet files. I can easily list the files for the partitions keys with:
dbfs.fs.ls('/data/date=2022-11-30/region=usa')
But, if I now make an update to the table, it regenerates the parquet files and now I have 4 files
in that directory.
How can I retrieve the latest version
of the parquet files? Do I really have to loop through all the _delta_log
state files and rebuild the state? Or do I have to run VACCUM
to cleanup the old versions so I can get the most recent files?
There has to be a magic function.