I am open to other ways of doing this. Here are my constraints:
- I have parquet files in a container in Azure Blob Storage
- These parquet files will be partitioned by a product id, as well as the date (year/month/day)
- I am doing this in R, and want to be able to connect interactively (not just set up a notebook in databricks, though that is something I will probably want to figure out later)
Here's what I am able to do:
- I understand how to use
arrow::open_dataset()
to connect to a local parquet directory:ds <- arrow::open_dataset(filepath, partitioning = "product")
- I can connect to, view, and download from my blob container with the
AzureStor
package. I can download a single parquet file this way and turn it into a data frame:
blob <- AzureStor::storage_endpoint("{URL}", key="{KEY}")
cont <- AzureStor::storage_container(blob, "{CONTAINER-NAME}")
parq <- AzureStor::storage_download(cont, src = "{FILE-PATH}", dest = NULL)
df <- arrow::read_parquet(parq)
What I haven't been able to figure out is how to use arrow::open_dataset()
to reference the parent directory of {FILE-PATH}
, where I have all the parquet files, using the connection to the container that I'm creating with AzureStor
. arrow::open_dataset()
only accepts a character vector as the "sources" parameter. If I just give it the URL with the path, I'm not passing any kind of credential to access the container.