Identify partitioning variable in parquet file

Question

Is there an easy way of identifying the variable that was used to partition a parquet dataset?

As an example, below I create a toy parquet using the mtcars dataset.

# Load library
library(arrow)

# Write data to parquet
mtcars |> write_dataset("~/boop", partitioning = "cyl")

One approach to determining the partitioning variable(s) could be to view the files that the parquet is composed of, like so:

# Open dataset & see files that are part of parquet
open_dataset("~/boop")$files

# [1] "XXXXX/boop/cyl=4/part-0.parquet" "XXXXX/boop/cyl=6/part-0.parquet"
# [3] "XXXXX/boop/cyl=8/part-0.parquet"

Here, I can see that cyl is the partitioning variable, but I would need to parse that out and if there are several partitioning variables it might get a smidge involved.

Is there a simple way of determining the partitioning variable? For example, is there a metadata variable that records this information?

score 0 · Accepted Answer · answered Jan 11 '23 at 15:09

Until someone suggests a better solution, this seems to work:

# Load library
library(arrow)

# Write data to parquet
mtcars |> write_dataset("~/boop", partitioning = c("cyl", "gear"))

# Files in parquet
pq_files <- open_dataset("~/boop")$files

# Extract partiton names assuming */partition_name=value/* format
regmatches(pq_files, gregexpr("(?<=/)[^/]*(?==)", pq_files, perl = TRUE)) |> unlist() |> unique()
# [1] "cyl"  "gear"

As suggested in the question, I look at the files in the parquet and then use some regex to look for text sandwiched between / and = that should correspond to partitions.

Identify partitioning variable in parquet file

1 Answers1