0

I do have a remote connection to a SQL table using dbplyr. One of the table is composed of one ID column and several other columns storing 0 and 1 values (SQL bit - interpreted as boolean TRUE/FALSE values from R side) and from R I simply want to get the total number of 1 for each row.

It is straightforward in R with an usual table using for instance rowSums() which unfortunately does not work through dbplyr (no SQL equivalent).

For obvious reasons due to the size of the underlying table I do not want to collect() the data.

How one could achieve that in such a context?

library(dplyr)
# Local case
DF <- tibble(ID = LETTERS[1:3], col1 = c(1,1,1), col2 = c(1,1,0), col3 = c(1,0,0))
DF %>% 
  summarise(sum = rowSums(select(., -1)))
#   sum
# 1   3
# 2   2
# 3   1

# If DF is a remote SQL table, therefore one would get the following error message:  
# Error: nanodbc/nanodbc.cpp:1655: 42000: [Microsoft][ODBC SQL Server Driver][SQL Server]'rowSums' is not a recognized built-in function name.  [Microsoft][ODBC SQL Server Driver][SQL Server]

EDIT - ADDING MINIMAL REPRODUCIBLE EXAMPLE

Following @Simon.S.A. reply, below a MRE:

# Table creation
DF <- tibble(ID = LETTERS[1:3], col1 = c(1,1,1), col2 = c(1, 1,0), col3 = c(1,0,0))
colnames(DF) <- c("col 1", "col 2", "col 3", "col 4")
# SQL simulation
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, DF)
con %>% tbl("DF") # just checking
#preparing formula
cols <- colnames(DF)[-1]
all_equations <- paste0("`", cols, "` =  sum(`", cols,"`)")
# actual query
con %>% 
  tbl("DF") %>% 
  summarise(!!!rlang::parse_exprs(all_equations))
# Error: near "=": syntax error
# %>% show_query() shows a strange query, but I am no SQL expert as you understood.
# also tried: 
# all_equations <- paste(cols ,"=  sum(",cols,")")
# all_equations <- paste0("`[", cols, "]` =  sum(`[", cols,"]`)")
yeahman269
  • 705
  • 7
  • 16

3 Answers3

1

Part of the challenge here is that dbplyr translates dplyr commands into SQL, but translations are only defined for some R commands. As translations exists for the standard dplyr commands, we can use summarise.

To summarise all in one go, we can do the following:

library(dplyr)
library(rlang)

cols = colnames(DF)
cols = cols[2:length(cols)]

all_equations = paste(cols ,"=  sum(",cols,")")


DF %>%
  summarise(!!!parse_exprs(all_equations))

The idea is to build text strings of every sum, and then use !!!parse_exprs(.) to turn this text into R code.

EDIT - Same approach but for row sums

# Table creation
DF <- tibble(ID = LETTERS[1:3], col1 = c(1,1,1), col2 = c(1, 1,0), col3 = c(1,0,0))
colnames(DF) <- c("col 1", "col 2", "col 3", "col 4")
# SQL simulation
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, DF)
con %>% tbl("DF") # just checking
#preparing formula
cols <- colnames(DF)[-1]
eq <- paste0("`",paste0(cols, collapse = "` + `"),"`")
# actual query
con %>% 
  tbl("DF") %>% 
  mutate(new = !!parse_expr(eq))

But still depends on dbplyr translation and so may not handle back-ticks correctly.

Simon.S.A.
  • 6,240
  • 7
  • 22
  • 41
  • Hi Simon.S.A.. Thanks for the insights, it is helping and I think we are close to the final answer. I can't get it working. I updated the initial question with a MRE – yeahman269 Jun 18 '21 at 14:26
  • Sorry, I misread your question and answered for column sums not row sums. Updating with code for row sums. – Simon.S.A. Jun 18 '21 at 21:28
  • Thanks @Simon.S.A. it works as expected. As the SQL values are boolean values a `as.numeric()` will be required in the `eq` expression: `paste0(paste0("as.numeric(`", col_targets, collapse = "`) + "),"`)")`. – yeahman269 Jun 21 '21 at 12:12
  • is there by any chance a way of evaluating the expression `eq` above without calling for rlang (or any other package which is not included in `tidyverse`)? – yeahman269 Jun 21 '21 at 13:07
  • Not sure I understand this request. rlang is a required package for the tidyverse. You can see on the dplyr CRAN page that dplyr imports rlang: https://cran.r-project.org/web/packages/dplyr/index.html – Simon.S.A. Jun 21 '21 at 21:08
1

I found that a possible workaround is to write down the actual query, using DBI package for instance. But I remain interested by a more elegant way using dbplyr.

DF <- tibble(ID = LETTERS[1:3], col1 = c(1,1,1), col2 = c(1, 1,0), col3 = c(1,0,0))
colnames(DF) <- c("col 1", "col 2", "col 3", "col 4") # having spaces in column names increase handling complexity
# SQL simulation
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, DF)
con %>% tbl("DF") # just checking
cols <- colnames(DF)[-1]
col2select <- colnames(DF) # column to select in the result
query <- paste0("SELECT ", 
                paste0("[", col2select, "]", collapse =", "), 
                ", ", 
                # paste0("CAST([", cols,"] AS INT)", collapse = " + "), 
                paste0("[", cols,"]", collapse = " + "), 
                " AS sum FROM DF")
rs <- DBI::dbSendQuery(con, query)
DBI::dbFetch(rs)
DBI::dbClearResult(rs)
DBI::dbDisconnect(con)
yeahman269
  • 705
  • 7
  • 16
  • Yes, dbplyr does not handle translating backticks well. Thankfully these are only necessary if your column names contains spaces or special characters. – Simon.S.A. Jun 18 '21 at 21:25
1

Using tidyr makes for pretty readable dplyr code. How this performs with a large table is less clear.

library(dplyr, warn.conflicts = FALSE)
library(tidyr)
DF <- tibble(ID = LETTERS[1:3], col1 = c(1,1,1), 
             col2 = c(1, 1,0), col3 = c(1,0,0))

# SQL simulation
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
DF <- copy_to(con, DF, overwrite = TRUE)

result <- 
  DF %>% 
  pivot_longer(cols = -ID) %>% 
  group_by(ID) %>% 
  summarize(sum = sum(value, na.rm = TRUE))

result
#> # Source:   lazy query [?? x 2]
#> # Database: sqlite 3.35.5 [:memory:]
#>   ID      sum
#>   <chr> <dbl>
#> 1 A         3
#> 2 B         2
#> 3 C         1

result %>% show_query()
#> <SQL>
#> SELECT `ID`, SUM(`value`) AS `sum`
#> FROM (SELECT `ID`, 'col1' AS `name`, `col1` AS `value`
#> FROM `DF`
#> UNION ALL
#> SELECT `ID`, 'col2' AS `name`, `col2` AS `value`
#> FROM `DF`
#> UNION ALL
#> SELECT `ID`, 'col3' AS `name`, `col3` AS `value`
#> FROM `DF`)
#> GROUP BY `ID`

Created on 2021-06-18 by the reprex package (v2.0.0)

library(dplyr, warn.conflicts = FALSE)
library(DBI)

n <- 26e3

df <- tibble(ID = rep(LETTERS, n/26))

for (i in 1:100) df[[paste0("col", i)]] <- rbinom(prob = 0.5, n = n, size = 1)

# SQL simulation
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

df_sql <- copy_to(con, df, overwrite = TRUE)
row_sum_1 <- function(df, con) {
    sum_cols <- setdiff(colnames(df), "ID")
    names <- paste(DBI::dbQuoteIdentifier(con, colnames(df)), collapse = ", ")
    sum_sql <- paste(DBI::dbQuoteIdentifier(con, sum_cols), collapse = " + ")
    
    query <- paste0("SELECT ", names, ", ", 
                    sum_sql," AS sum FROM df")
    tbl(con, sql(query))
}

row_sum_1(df_sql, con) %>% select(ID, sum)
#> # Source:   lazy query [?? x 2]
#> # Database: sqlite 3.35.5 [:memory:]
#>    ID      sum
#>    <chr> <int>
#>  1 A        49
#>  2 B        53
#>  3 C        54
#>  4 D        49
#>  5 E        51
#>  6 F        46
#>  7 G        55
#>  8 H        48
#>  9 I        44
#> 10 J        50
#> # … with more rows
system.time(compute(row_sum_1(df_sql, con)))
#>    user  system elapsed 
#>   0.307   0.007   0.315

Created on 2021-06-21 by the reprex package (v2.0.0)

Ian Gow
  • 3,098
  • 1
  • 25
  • 31
  • Thanks for the proposal. I thought of this approach and forgot to mention in the initial post that I think it is prefereable to avoid changing data structure since the actual SQL table is quite large. I up-voted your answer but selected the direct approach of @Simon.S.A. as anwer. – yeahman269 Jun 21 '21 at 12:15
  • Yes, I believe the query created with many `UNION ALL` components would not work well with large data sets. I don't have SQL Server to test with a larger data frame, but I think one can get around the need to worry about backticks and the like using `dbQuoteIdentifier` as I do in a version of the accepted answer I added at the bottom of my own. – Ian Gow Jun 21 '21 at 16:58