1

dbplyr has some very useful-looking simulation functions so you can write queries while not connected to any real database, but I can't seem to get actual table names into any of the queries I write that way. All their names are just `df`, and I can't seem to modify them afterward. In fact, I don't see `df` anywhere in the query object or its attributes (it doesn't have any), so now I have no idea how dbplyr processes table names at all.

MWE:

library(dbplyr)
library(dplyr, warn.conflicts = FALSE)
library(magrittr)
library(purrr, warn.conflicts = FALSE)

query <- tbl_lazy(df = mtcars)
query %$% names(ops)
#> [1] "x"    "vars"
show_query(query)
#> <SQL>
#> SELECT *
#> FROM `df`

# The actual data frame is stored in the object under the name `x`, but
# renaming it has no effect, unsurprisingly (since it wasn't named `df`
# anyway)
query %<>% modify_at("ops", set_names, "mtcars", "vars")
query %$% names(ops)
#> [1] "mtcars" "vars"
show_query(query)
#> <SQL>
#> SELECT *
#> FROM `df`

My use case, by the way, is that I need to run SQL queries in another system with actual server access, so I'd like to have R scripts that produce SQL syntax that's ready to run in that system, even though R can't connect to it. Making an empty dummy database with the structure of the real thing (table & column names, column types, but no rows) is an option, but, obviously, it'd be simpler to just use these free-form simulations, iff the SQL can be generated ready to cut and paste. (lazy_frame() looked more appropriate for such non-existent tables, but, guess what, it's really just a wrapper for tbl_lazy(tibble()), so, same exact `df` name problem.)

Created on 2019-12-12 by the reprex package (v0.3.0)

DHW
  • 1,157
  • 1
  • 9
  • 24
  • 1
    I still find this is an under-appreciated use case (_c.f._ https://stackoverflow.com/q/49078185/6636442 ), but have no good solution. I have tried the in-memory sqlite trick, which allows you to name initial tables, but the rendered sql requires a lot of hand editing to impose table names on sub queries. – steveo'america Feb 05 '20 at 20:38

1 Answers1

0

I am not aware of any way to rename the simulated tables. According to to documentation, the important point of the simulate_* functions is to test database translation without actually connecting to a database.

When connected to a remote table, dbplyr uses the database, schema, and table name defined using tbl(). It also fetches the column names. Because of this, I would recommend developing in an environment where R can connect to the database. Consider the following:

# simulated
df_sim = tbl_lazy(mtcars, con = simulate_mssql())
df_sim %>% head(5) %>% show_query()

# output
<SQL>
SELECT TOP(5) *
FROM `df`

# actual
df = tbl('db_table_name', con = database_connection_object)
df %>% head(5) %>% show_query()

# output
<SQL>
SELECT TOP(5) col1, col2, col3
FROM "database"."db_table_name"

Not only does df get replaced by the table name, but the * in the simulated query is replaced by column names in the second query.

One option you might consider if it is important to generate SQL scripts via simulation is converting to text, replacing, and converting back. For example:

df_sim = tbl_lazy(mtcars, con = simulate_mssql())
query = df_sim %>% head(5) %>% as.character()
query = gsub("`df`", "[db].[schema].[table]", query)

# write query out to file
writeLines(query, "file.sql")
# OR create a remote connection
remote_table = tbl(db_connection, sql(query))

remote_table %>% show_query()
# output
<SQL>
SELECT TOP(5) *
FROM [db].[schema].[table]
Simon.S.A.
  • 6,240
  • 7
  • 22
  • 41
  • Once you join two tables you'll find they are both called `df` and this hack will not work. – steveo'america Feb 05 '20 at 20:40
  • Good point @steveo'america. I would recommend trying the substitution before the join, immediately after the `tbl_lazy` command for each table. – Simon.S.A. Feb 06 '20 at 00:28