Is there a built-in way to define a DataFrame
as a set of partition paths (each with one or more files), use that DataFrame
as the basis of a set of so-called "mutation" queries which are defined as a separate DataFrame
, and partition the resulting DataFrame
by the same columns, only writing the "changed" partitions (i.e. only those partitions where the data is different between the original DataFrame
and the resulting DataFrame
)?
This would be for "dimension" data, of course.
The whole point is to reduce total storage by re-using partition files that haven't changed. Perhaps storage is cheap enough that this is pointless. Nevertheless, it would be good to know.
Obviously there's ways to do this by specifying a series of transforms myself:
{original DataFrame} -> o
{resulting DataFrame -> r
{*, COUNT(o.*) GROUP BY *} -> o2
{*, COUNT(r.*) GROUP BY *} -> r2
{DISTINCT [partition columns] FROM o2 FULL JOIN r2 ON [partition columns] WHERE o2.* IS NULL OR r2.* IS NULL}
I can't tell if there would be any shuffling in the general case. Given that the JOIN
is on partition columns and that's how the original and resulting DataFrame
s are both partitioned, perhaps there would be none.