Is there an easy way of identifying the variable that was used to partition a parquet dataset?
As an example, below I create a toy parquet using the mtcars
dataset.
# Load library
library(arrow)
# Write data to parquet
mtcars |> write_dataset("~/boop", partitioning = "cyl")
One approach to determining the partitioning variable(s) could be to view the files that the parquet is composed of, like so:
# Open dataset & see files that are part of parquet
open_dataset("~/boop")$files
# [1] "XXXXX/boop/cyl=4/part-0.parquet" "XXXXX/boop/cyl=6/part-0.parquet"
# [3] "XXXXX/boop/cyl=8/part-0.parquet"
Here, I can see that cyl
is the partitioning variable, but I would need to parse that out and if there are several partitioning variables it might get a smidge involved.
Is there a simple way of determining the partitioning variable? For example, is there a metadata variable that records this information?