Fetching big SQL table in the web app session

Question

I'm quite new in web app so apologize if my question is abit basic. I'm developing a Web app with R shiny where the inputs are very large tables from Azure SQL server. They are 20 tables each in the order of hundred-thousand rows and hundreds of columns containing numbers, Characters and etc. I have no problem calling them, my main issue is that it takes so much time to fetch everything from Azure SQL server. It takes approximately 20 minutes. So the user of the web app needs to wait quite a long. I'm using DBI package as follows:

db_connect <- function(database_config_name){
  dbConfig <- config::get(database_config_name)
  connection <- DBI::dbConnect(odbc::odbc(),
                        Driver = dbConfig$driver,
                        Server = dbConfig$server,
                        UID    = dbConfig$uid,
                        PWD    = dbConfig$pwd,
                        Database = dbConfig$database,
                        encoding = "latin1"
  )

  return(connection)
}

and then fetching tables by :

connection <- db_connect(db_config_name)
table <- dplyr::tbl(con, dbplyr::in_schema(fetch_schema_name(db_config_name,table_name,data_source_type), fetch_table_name(db_config_name,table_name,data_source_type)))

I searched a lot but didn't come across a good solution, I appreciate any solutions can tackle this problem.

[This link](https://shiny.rstudio.com/articles/overview.html) is probably the best place to start. There is functionality so that your database can do the filtering and your Shiny app can just bring back those results (rather than the whole database). It is the transporting of data that probably takes the most time, so this should help you. — p0bs, Jan 06 '22 at 10:54
@p0bs That was not much help unfortunately. I already tired different functions and drivers. My host is Azure SQL server. Do you know whether it makes it faster to fetch tables if I try other SQL servers (Like Mysql,Postgress and etc) or Non-SQL databases ? — Yasin Amini, Jan 06 '22 at 13:29

r2evans · Accepted Answer · 2022-01-06T14:16:41.513

I work with R accessing SQL Server (not Azure) daily. For larger data (as in your example), I always revert to using the command-line tool sqlcmd, it is significantly faster. The only pain point for me was learning the arguments and working around the fact that it does not return proper CSV, there is post-query munging required. You may have an additional pain-point of having to adjust my example to connect to your Azure instance (I do not have an account).

In order to use this in a shiny environment and preserve its interactivity, I use the processx package to start the process in the background and then poll its exit status periodically to determine when it has completed.

Up front: this is mostly a "loose guide", I do not pretend that this is a fully-functional solution for you. There might be some rough-edges that you need to work through yourself. For instance, while I say you can do it asynchronously, it is up to you to work the polling process and delayed-data-availability into your shiny application. My answer here provides starting the process and reading the file once complete. And finally, if encoding= is an issue for you, I don't know if sqlcmd does non-latin correctly, and I don't know if or how to fix this with its very limited, even antiquated arguments.

Steps:

Save the query into a text file. Short queries can be provided on the command-line, but past some point (128 chars? I don't know that it's clearly defined, and have not looked enough recently) it just fails. Using a query-file is simple enough and always works, so I always use it.

I always use temporary files for each query instead of hard-coding the filename; this just makes sense. For convenience (for me), I use the same tempfile base name and append .sql for the query and .csv for the returned data, that way it's much easier to match query-to-data in the temp files. It's a convention I use, nothing more.
```
tf <- tempfile()
# using the same tempfile base name for both the query and csv-output temp files
querytf <- paste0(tf, ".sql")
writeLines(query, querytf)
csvtf <- paste0(tf, ".csv")
# these may be useful in troubleshoot, but not always [^2]
stdouttf <- paste0(tf, ".stdout")
stderrtf <- paste0(tf, ".stderr")
```

Make the call. I suggest you see how fast this is in a synchronous way first to see if you need to add an async query and polling in your shiny interface.

exe <- "/path/to/sqlcmd" # or "sqlcmd.exe"
args <- c("-W", "b", "-s", "\037", "-i", querytf, "-o", csvtf,
          "-S", dbConfig$server, "-d", dbConfig$database,
          "-U", dbConfig$uid, "-P", dbConfig$pwd)
## as to why I use "\037", see [^1]
## note that the user id and password will be visible on the shiny server
## via a `ps -fax` command-line call
proc <- processx::process$new(command = exe, args = args,
                              stdout = stdouttf, stderr = stderrtf) # other args exist
# this should return immediately, and should be TRUE until
# data retrieval is done (or error)
proc$is_alive()
# this will hang (pause R) until retrieval is complete; if/when you
# shift to asynchronous queries, do not do this
proc$wait()

One can use processx::run instead of process$new and proc$wait(), but I thought I'd start you down this path in case you want/need to go asynchronous.

If you go with an asynchronous operation, then periodically check (perhaps every 3 or 10 seconds) proc$is_alive(). Once that returns FALSE, you can start processing the file. During this time, shiny will continue to operate normally. (If you do not go async and therefore choose to proc$wait(), then shiny will hang until the query is complete.)

If you make a mistake and do not proc$wait() and try to continue with reading the file, that's a mistake. The file may not exist, in which case it will err with No such file or directory. The file may exist, perhaps empty. It may exist and have incomplete data. So really, make a firm decision to stay synchronous and therefore call proc$wait(), or go asynchronous and poll periodically until proc$is_alive() returns FALSE.

Reading in the file. There are three "joys" of using sqlcmd that require special handling of the file.

It does not do embedded quotes consistently, which is why I chose to use "\037" as a separator. (See [^1].)
It adds a line of dashes under the column names, which will corrupt the auto-classing of data when R reads in the data. For this, we do a two-step read of the file.
Nulls in the database are the literal NULL string in the data. For this, we update the na.strings= argument when reading the file.

exitstat <- proc$get_exit_status()
if (exitstat == 0) {
  ## read #1: get the column headers
  tmp1 <- read.csv(csvtf, nrows = 2, sep = "\037", header = FALSE)
  colnms <- unlist(tmp1[1,], use.names = FALSE)
  ## read #2: read the rest of the data
  out <- read.csv(csvtf, skip = 2, header = FALSE, sep = "\037",
                  na.strings = c("NA", "NULL"), quote = "")
  colnames(out) <- colnms
} else {
  # you should check both stdout and stderr files, see [^2]
  stop("'sqlcmd' exit status: ", exitstat)
}

Note:

After a lot of pain with several issues (some in sqlcmd.exe, some in data.table::fread and other readers, all dealing with CSV-format non-compliance), at one point I chose to stop working with comma-delimited returns, instead opting for the "\037" field Delimiter. It works fine with all CSV-reading tools and has fixed so many problems (some not mentioned here). If you're not concerned, feel free to change the args to "-s", "," (adjusting the read as well).
sqlcmd seems to use stdout or stderr in different ways when there are problems. I'm sure there's rationale somewhere, but the point is that if there is a problem, check both files.

I added the use of both stdout= and stderr= because of a lot of troubleshooting I did, and continue to do if I munge a query. Using them is not strictly required, but you might be throwing caution to the wind if you omit those options.
By the way, if you choose to only use sqlcmd for all of your queries, there is no need to create a connection object in R. That is, db_connect may not be necessary. In my use, I tend to use "real" R DBI connections for known-small queries and the bulk sqlcmd for anything above around 10K rows. There is a tradeoff; I have not measured it sufficiently in my environment to know where the tipping point is, and it is likely different in your case.

The knowledge seems hard-won, and I'm sure it will help more than just OP. — Gregor Thomas, Jan 06 '22 at 14:18
bcp is the command-line tool for extracting data from SQL Server. It doesn’t have the issues sqlcmd has, as you’ve described, but may present some new ones. Of course, bear in mind if you are building in manual text extracts to processes you are effectively plugging yourself and your laptop into the process. — wBob, Jan 06 '22 at 14:27
@wBob (edited) yes, I've tried `bcp` as well, though as you said it has its own issues. For instance, I cannot get it to include column names, and I cannot find a way to properly quote the data. A [hack](https://stackoverflow.com/a/2061147/3358272) will cause ambiguity between an empty string and a null value. While I *can* work around the column names omission (though that's just stupid, imo), the null-vs-empty-string problem is bad for data-integrity (there's a difference between "I know it to be 0 chars" and "I do not know"). — r2evans, Jan 06 '22 at 14:54
The queryout mode allows you to craft SQL statements, views or even stored procedures which address some of those issues. I’m pretty sure I used to add the headers with a batch file and the append operator (>>). Or you could look at SSIS for more complex (on-premises) extracts. — wBob, Jan 06 '22 at 15:55
I've seen those recommendations. Changing a query to work around broken CSV and lack of headers is problematic, especially from a programmatic viewpoint: the only ways to *know* column headers from a query are either (a) parse the SQL itself and hopefully get all column names correctly, and (b) run the query with `TOP 0` to get the names, but with expensive queries this is bad. SSIS is not an option for programmatic use. — r2evans, Jan 06 '22 at 16:01
@GregorThomas, thanks. "Hard-won" is an apt label ... while I have no expectation that I've misread something or missed something in the docs, there was a LOT of googling and searching SO for so many of these ... "bcp quoted fields" and "bcp column names" and such. Too many times I think I've had it and then go to "production" looking at different tables (and class types, etc), only to find another corner-case. In my private package, I have whole roxygen sections that count as "rants against breaking standards". Sigh. — r2evans, Jan 06 '22 at 16:40
I am surprised you're using `\037`, `fortunes::fortune("james bond")` led me to believe `\031` was the best choice (other than perhaps `\007`). — Gregor Thomas, Jan 06 '22 at 16:57
I should have said *"while I have no expection that I **have not** misread"*. Oops. — r2evans, Jan 06 '22 at 17:16
@r2evans Hi , Thank you very much for the answer. It saved me lot's of time. It seems you're working with shiny for a long . Have you ever come across an answer to Adapt/update filter choices to already applied filters in shiny DT datatable. When you filter a column(suppose everything is factor), the rest of the column still shows all options to choose, not filtered options. Here is the link to the questions: https://stackoverflow.com/questions/57867482/adapt-update-filter-choices-to-already-applied-filters-in-shiny-dt-datatable — Yasin Amini, Jan 27 '22 at 13:30
I've used but am not proficient enough at `DT` for a quick answer, and unfortunately I don't have the time atm to dive into something deeper. Sorry, I hope somebody else can help you sooner. — r2evans, Jan 27 '22 at 13:48

Fetching big SQL table in the web app session

1 Answers1