This is partly related to my previous question. If I filter
a dataframe using dplyr
based on unique ids with trailing white space from ids with no trailing white space, dplyr
will consider white space to be a character and a match will not occur, resulting in an empty dataframe:
library(tidyverse)
df <- tibble(a = c("hjhjh"), d = c(1))
df
# # A tibble: 2 x 2
# a d
# <chr> <dbl>
# 1 hjhjh 1
ids <- df %>%
select(a) %>%
pull()
ids
#[1] "hjhjh"
df_with_space <- tibble(a = c("hjhjh ", "popopo"), d = c(1, 2))
df_with_space
#quotation marks:
# # A tibble: 2 x 2
# a d
# <chr> <dbl>
# 1 "hjhjh " 1
# 2 "popopo" 2
#now filter
df_new <- df_with_space %>%
filter(a %in% ids)
df_new
# no direct match made, empty dataframe
# A tibble: 0 x 2
# ... with 2 variables: a <chr>, d <dbl>
If I try to do the same thing and filter using dbplyr
from a SQL
database, it ignores the white space in the filtering but still includes it in the final output, example code:
library(dbplyr)
library(DBI)
library(odbc)
test_db <- dbConnect(odbc::odbc(),
Database = "test",
dsn = "SQL_server")
db_df <- tbl(test_db, "testing")
db_df <- db_df %>%
filter(a %in% ids) %>%
collect()
#quotation marks:
# # A tibble: 1 x 2
# a d
# <chr> <dbl>
# 1 "hjhjh " 1 #matches but includes the white space
I'm not familiar with SQL
- is this expected? If so, when do you need to worry about (trailing) white space? I thought I would need to trim
whitespace first which is very slow on a large database:
db_df <- db_df %>%
mutate(a = str_trim(a, "both")) %>%
filter(a %in% ids) %>%
collect()
thanks
EDIT
With show_query
<SQL>
SELECT *
FROM `df`
WHERE (`a` IN ('hjhjh'))
I think this produces a reproducible scenario:
dfx <- data.frame(a = c("hjhjh ", "popopo"), d = c(1, 2))
dfx = tbl_lazy(dfx, con = simulate_mssql())
dfx %>%
filter(a %in% ids)
# <SQL>
# SELECT *
# FROM `df`
# WHERE (`a` IN ('hjhjh'))