4

I am open to other ways of doing this. Here are my constraints:

  • I have parquet files in a container in Azure Blob Storage
  • These parquet files will be partitioned by a product id, as well as the date (year/month/day)
  • I am doing this in R, and want to be able to connect interactively (not just set up a notebook in databricks, though that is something I will probably want to figure out later)

Here's what I am able to do:

  • I understand how to use arrow::open_dataset() to connect to a local parquet directory: ds <- arrow::open_dataset(filepath, partitioning = "product")
  • I can connect to, view, and download from my blob container with the AzureStor package. I can download a single parquet file this way and turn it into a data frame:
blob <- AzureStor::storage_endpoint("{URL}", key="{KEY}")
cont <- AzureStor::storage_container(blob, "{CONTAINER-NAME}")
parq <- AzureStor::storage_download(cont, src = "{FILE-PATH}", dest = NULL)
df <- arrow::read_parquet(parq)

What I haven't been able to figure out is how to use arrow::open_dataset() to reference the parent directory of {FILE-PATH}, where I have all the parquet files, using the connection to the container that I'm creating with AzureStor. arrow::open_dataset() only accepts a character vector as the "sources" parameter. If I just give it the URL with the path, I'm not passing any kind of credential to access the container.

Chris Umphlett
  • 380
  • 3
  • 15
  • Have you looked at using Azure Synapse? – Dean MacGregor Mar 09 '22 at 21:14
  • No I haven’t. It’s been recommended to us for other reasons. I’m a noob at all this cloud BI stuff there’s a lot of overhead and IT assistance involved in just trying something, hard to justify if I don’t know that it helps with a specific use case. – Chris Umphlett Mar 09 '22 at 22:06
  • Ultimately we ended up doing this in a databricks notebook with pyspark, which was a lot faster/more efficient use of memory. – Chris Umphlett Jun 09 '22 at 17:13

2 Answers2

1

Unfortunately, you probably are not going to be able to do this today purely from R.

Arrow-R is based on Arrow-C++ and Arrow-C++ does not yet have a filesystem implementation for Azure. There are JIRA tickets ARROW-9611,ARROW-2034 for creating one but these tickets are not in progress at the moment.

In python it is possible to create a filesystem purely in python using the FSspec adapter. Since there is a python SDK for Azure Blob Storage it should be possible to do what you want today in python.

Presumably something similar could be created for R but you would still need to create the R equivalent of the fsspec adapter and that would involve some C++ code.

Pace
  • 41,875
  • 13
  • 113
  • 156
  • that's a bummer. We'll try with Python, I could put that in a notebook before switching to R. We can also modify the Blob storage constraint... some subset of the parquet directory could be moved to a local/network drive first but that could get expensive, both $ and time-wise – Chris Umphlett Dec 10 '21 at 22:39
  • There is a brief blurb on pyarrow/adlfs here: https://arrow.apache.org/docs/python/filesystems.html#using-fsspec-compatible-filesystems-with-arrow but I haven't personally tried it. – Pace Dec 10 '21 at 23:12
1

If you use Azure Synapse then you can connect to your data with odbc as if it were a SQL Server database and it has support for partitioning and other files types as well. The pricing, from what I recall, is like $5/month fixed plus $5/TB queried.

Querying data would look something like this...

library(odbc)
syncon <- dbConnect(odbc(),
                    Driver = "SQL Server Native Client 11.0",
                    Server = "yourname-ondemand.sql.azuresynapse.net",
                    Database = "dbname",
                    UID = "sqladminuser",
                    PWD = rstudioapi::askForPassword("Database password"),
                    Port = 1433)

somedata <- dbGetQuery(syncon, r"---{SELECT top 100 
                       result.filepath(1) as year,
                       result.filepath(2) as month,
                       *
           FROM
           OPENROWSET(
           BULK 'blobcontainer/directory/*/*/*.parquet',
           DATA_SOURCE='blobname',
           FORMAT = 'parquet'
           
           ) as [result]
           
           order by node, pricedate, hour}---")

the filepath keyword refers to the name of the directory in the BULK path.

Here's the MS website https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/query-specific-files

You can also make views so that people who like SQL but not parquet files can query the views without having to know anything about the underlying data structure, it'll just look like a SQL Server database to them.

Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72