2

This is partly related to my previous question. If I filter a dataframe using dplyr based on unique ids with trailing white space from ids with no trailing white space, dplyr will consider white space to be a character and a match will not occur, resulting in an empty dataframe:

library(tidyverse)
df <- tibble(a = c("hjhjh"), d = c(1))
df
# # A tibble: 2 x 2
#   a          d
#   <chr>  <dbl>
# 1 hjhjh      1

ids <- df %>% 
  select(a) %>% 
  pull()
ids
#[1] "hjhjh"

df_with_space <- tibble(a = c("hjhjh ", "popopo"), d = c(1, 2))
df_with_space
#quotation marks:
# # A tibble: 2 x 2
#   a            d
#   <chr>    <dbl>
# 1 "hjhjh "     1
# 2 "popopo"     2

#now filter
df_new <- df_with_space %>% 
  filter(a  %in% ids)
df_new
# no direct match made, empty dataframe
# A tibble: 0 x 2
# ... with 2 variables: a <chr>, d <dbl>

If I try to do the same thing and filter using dbplyr from a SQL database, it ignores the white space in the filtering but still includes it in the final output, example code:

library(dbplyr)
library(DBI)
library(odbc)
test_db <- dbConnect(odbc::odbc(),
                       Database = "test",
                       dsn = "SQL_server") 
db_df <- tbl(test_db, "testing")
db_df <- db_df %>% 
  filter(a  %in% ids) %>% 
  collect()
#quotation marks:
# # A tibble: 1 x 2
#   a            d
#   <chr>    <dbl>
# 1 "hjhjh "     1   #matches but includes the white space

I'm not familiar with SQL - is this expected? If so, when do you need to worry about (trailing) white space? I thought I would need to trim whitespace first which is very slow on a large database:

db_df <- db_df %>% 
  mutate(a = str_trim(a, "both")) %>% 
  filter(a  %in% ids) %>% 
  collect() 

thanks


EDIT

With show_query

<SQL>
SELECT *
FROM `df`
WHERE (`a` IN ('hjhjh'))

I think this produces a reproducible scenario:

dfx <- data.frame(a = c("hjhjh ", "popopo"), d = c(1, 2))
dfx = tbl_lazy(dfx, con = simulate_mssql())
dfx %>% 
  filter(a  %in% ids) 
# <SQL>
# SELECT *
# FROM `df`
# WHERE (`a` IN ('hjhjh'))
user63230
  • 4,095
  • 21
  • 43

1 Answers1

2

If you're connecting to SQL Server, then I can reproduce this. I'll label it as a "bug", personally, and will never rely on it ...

No need to use dbplyr here, the issue is in the underlying DBMS; dbplyr is just the messenger, don't blame the messenger :-)

Setup

consqlite <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
conpg <- DBI::dbConnect(odbc::odbc(), ...)
conmar <- DBI::dbConnect(odbc::odbc(), ...)
conss <- DBI::dbConnect(odbc::odbc(), ...)
cons <- list(sqlite = consqlite, postgres = conpg, maria = conmar, sqlserver = conss)

df_with_space <- tibble(a = c("hjhjh ", "popopo"), d = c(1, 2))
for (thiscon in cons) {
  DBI::dbWriteTable(thiscon, "mytable", df_with_space)
}

Tests

lapply(cons, function(thiscon) {
  DBI::dbGetQuery(thiscon, "select * from mytable where a in ('hjhjh')")
})
# $sqlite
# [1] a d
# <0 rows> (or 0-length row.names)
# $postgres
# [1] a d
# <0 rows> (or 0-length row.names)
# $maria
#        a d
# 1 hjhjh  1
# $sqlserver
#        a d
# 1 hjhjh  1

lapply(cons, function(thiscon) {
  DBI::dbGetQuery(thiscon, "select * from mytable where a in ('popopo ')")
})
# $sqlite
# [1] a d
# <0 rows> (or 0-length row.names)
# $postgres
# [1] a d
# <0 rows> (or 0-length row.names)
# $maria
#        a d
# 1 popopo 2
# $sqlserver
#        a d
# 1 popopo 2

SQL Server and MariaDB "fail" in both test cases, neither SQLite nor Postgres fall for it.

I don't see this in the SQL spec, so I don't know if these are bugs, unintended/undocumented features, options, or something else.

Workaround

Sorry, I don't have one off-hand. (Not without accepting this "feature" and doing additional filtering post-query.)

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Interesting thanks. So to be fully safe, you think a step like `mutate(a = str_trim(a, "both")) ` may be necessary? Run time is `2714`s vs `4`s! – user63230 Aug 31 '21 at 14:13
  • I'm not certain, to be honest. I don't use `dbplyr` regularly, so I don't know how that is rendered into SQL nor how awesome it is. I was hoping to initially find a SQL-based mitigation, but `IN (...)` doesn't work with operators, so I fear that it may require a post-`dbplyr` step to double-check (and filter, as needed) these things. Not a great thing, in my mind, but perhaps the only way since you appear to be using one of the "buggy" (?) DBMSes. – r2evans Aug 31 '21 at 14:19
  • 1
    I think this discussion is relevant https://stackoverflow.com/questions/60725072/undocumented-feature-when-select-in-varchar-with-trailing-whitespace-sql-server – user63230 Aug 31 '21 at 15:15