4

I am working with a large SQLite database, and I am using RSQLite and Rstudio to run many different queries at the same time. Many times the results of the queries are quite large, and I don't want to waste resources by reading them into R and then saving them to CSV. I would prefer to have the database itself manage this. However, this does not work and I'm not sure what else to try.

library(RSQLite)
db <- dbConnect(RSQLite::SQLite(), "test.sqlite")

dbWriteTable(db, "iris", iris)

dbListTables(db)


sql <- ".headers on
.mode csv
.output C:/Users/jmc6538/Documents/testing/setosa.csv
SELECT * FROM iris WHERE Species = 'setosa';"

result <- dbSendQuery(db, sql)

Which produces this error

> result <- dbSendQuery(db, sql)
Error: near ".": syntax error

I've tried putting in various combinations of escape characters and that doesn't fix the issue.

jamesguy0121
  • 1,124
  • 11
  • 28

2 Answers2

4

The syntax here is not accepted because these instructions are based on the SQLite tool. These are not simple sql statements but rather command lines to this tool:

  1. Install the sqlite command line utility from here

  2. Follow these instructions, perhaps modifying for use in R, such as

    shell('sqlite3 -header -csv c:/sqlite/chinook.db "select * from tracks;" > tracks.csv')

Make sure that SQLite3 is in your path, by opening a cmd and typing sqlite3. If you have the error sqlite is not recognized, then you'd need to add the installation folder containing sqlite3.exe or similar to the path. Or you could set the working directory in R to

setwd("C:/[installation folder containing sqlite3.exe]")
shell('sqlite3 -header -csv c:/sqlite/chinook.db "select * from tracks;" > tracks.csv')
gaut
  • 5,771
  • 1
  • 14
  • 45
  • Oooh, This is a good suggestion, I didn't think about just sending the query through the command line. I already had the command line utility, and it almost works. The redirect command isn't working. This is what I have, `system('sqlite3 -header -csv C:/Users/jmc6538/Documents/testing/test.sqlite "select * from iris where Species = \'setosa\';" > test.csv')` and the output in R prints the data and says `Error: near ">": syntax error` Any suggestions. Tried specifying the full path and that didn't do anything. – jamesguy0121 Nov 02 '20 at 20:40
  • It is because I am on Windows...of course. According to `system()` docs, `This means that it cannot be assumed that redirection or piping will work in system (redirection sometimes does, but we have seen cases where it stopped working after a Windows security patch), and system2 (or shell) must be used on Windows.` `system2()` didn't work for me, but `shell()` did. – jamesguy0121 Nov 02 '20 at 20:55
0

So I ended up doing some bench marking for this comparing the accepted answer with reading the table into R and then saving to csv since my goal was to cut down on compute time. Thought others might find this helpful and the results are fairly interesting. I tested it on my windows laptop as well as a super computer that I have access to. I did three different size tables for each environment.

For my laptop this was the commands.

microbenchmark(
  sys = {shell('sqlite3 -header -csv C:/Users/jmc6538/Documents/testing/test.sqlite "select * from iris;" > C:/Users/jmc6538/Documents/testing/test.csv')},
  RSQLite = {
    db <- dbConnect(RSQLite::SQLite(), "test.sqlite")
    result <- dbSendQuery(db, sql)
    write.csv(dbFetch(result), "./test2.csv")
    dbClearResult(result)
    dbDisconnect(db)
  })

The different sized tables were a single copy of the iris data (150 rows), the iris data set 4,502 times (675,300 rows), and the iris data set 450,200 times (6,753,000 rows). Here are the results in the same order.

Unit: milliseconds
    expr        min         lq      mean     median        uq       max neval cld
     sys 213.758862 216.612667 223.37484 217.858642 218.88561 342.42205   100   b
 RSQLite   6.981669   7.663255  10.08729   9.377483  10.90106  25.52816   100  a 

Unit: seconds
    expr       min        lq      mean    median        uq       max neval cld
     sys 10.633965 11.788077 12.844148 12.967207 13.757598 16.142357   100   b
 RSQLite  5.164632  6.791199  7.172043  7.351825  7.728878  8.486433   100  a 

Unit: seconds
    expr      min        lq      mean    median        uq       max neval cld
     sys 94.38379 101.36282 103.49785 102.98158 104.52018 114.95047   100   b
 RSQLite 48.48948  50.28782  53.18634  53.64562  55.80938  63.87987   100  a 

So on my laptop, using RSQLite and write.csv was always faster by a significant amount.

On the super computer the number of rows were 150, 900,000, and 9,000,000, and the commands were

microbenchmark(
  sys = {system('sqlite3 -header -csv C:/Users/jmc6538/Documents/testing/test.sqlite "select * from iris;" > C:/Users/jmc6538/Documents/testing/test.csv')},
  RSQLite = {
    db <- dbConnect(RSQLite::SQLite(), "test.sqlite")
    result <- dbSendQuery(db, sql)
    write.csv(dbFetch(result), "./test2.csv")
    dbClearResult(result)
    dbDisconnect(db)
  })

And the results.

Super Computer Benchmark Results

On the super computer using the system call was faster for the larger tables, but not for the small table.

Edit: Using data.table's fwrite to write the CSVs sped up the RSQLite method over the shell method on the super computer. I only did 10 repeats here for the 900,000 and 9,000,000 table because I didn't want to wait for 100. I think this is the better method if people are looking for speed.

enter image description here

jamesguy0121
  • 1,124
  • 11
  • 28