0

Is there a built-in way to define a DataFrame as a set of partition paths (each with one or more files), use that DataFrame as the basis of a set of so-called "mutation" queries which are defined as a separate DataFrame, and partition the resulting DataFrame by the same columns, only writing the "changed" partitions (i.e. only those partitions where the data is different between the original DataFrame and the resulting DataFrame)?

This would be for "dimension" data, of course.

The whole point is to reduce total storage by re-using partition files that haven't changed. Perhaps storage is cheap enough that this is pointless. Nevertheless, it would be good to know.

Obviously there's ways to do this by specifying a series of transforms myself:

  1. {original DataFrame} -> o
  2. {resulting DataFrame -> r
  3. {*, COUNT(o.*) GROUP BY *} -> o2
  4. {*, COUNT(r.*) GROUP BY *} -> r2
  5. {DISTINCT [partition columns] FROM o2 FULL JOIN r2 ON [partition columns] WHERE o2.* IS NULL OR r2.* IS NULL}

I can't tell if there would be any shuffling in the general case. Given that the JOIN is on partition columns and that's how the original and resulting DataFrames are both partitioned, perhaps there would be none.

jennykwan
  • 2,631
  • 1
  • 22
  • 33
  • [Slowly changing dimension](https://en.wikipedia.org/wiki/Slowly_changing_dimension) can be implemented in Spark, but if you're interested in mutation, just go with proper database. Overwriting storage as a part of the logic is a recipe for disaster. – zero323 Jan 22 '18 at 18:57
  • This isn't overwriting storage... I still have the original `DataFrame` partition files. I just have a new `DataFrame` that shares storage with the original `DataFrame`, so I cut down on disk space utilization. – jennykwan Jan 22 '18 at 19:44

0 Answers0