0

I need to analyse a large dataset (~40Go, 3 million rows); too big to open in a spreadsheet or R. To address that, I loaded it into an SQLite database, then used R (and RSQLite) to split into parts I can manipulate (70,000 rows). I need it in data.frame format. I used as.data.frame :

#Connecting to the database
con = dbConnect(drv=RSQLite::SQLite(),dbname="path")

#Connecting to the table
d=tbl(con, "Test")

#Filter the database and convert it
d %>% 
   #I filtered using dplyr
   filter("reduce number of rows") %>% 
   as.data.frame()

It works but takes a lot of time. How to make this faster (I have limited RAM)? I also tried setDT(), but it doesn't work on SQLite data:

d %>% setDT()

Error in setDT(.) : 
All elements in argument 'x' to 'setDT' must be of same length, but the profile of input lengths (length:frequency) is: [2:1, 13:1]
The first entry with fewer than 13 entries is 1
user4157124
  • 2,809
  • 13
  • 27
  • 42
Dimitri
  • 135
  • 7
  • Please indicate how you queried it and how much data you're pulling in. (I'm assuming it isn't all 3 million rows since you say you "split it in smaller parts".) – r2evans Feb 03 '23 at 13:28
  • 4
    I think there is very little you can do with this path to improve the speed: it appears that you are bringing in a lot of data based on your `filter`, so the only way to reduce the query time is to improve the filter so that fewer rows are returned. Lacking that, options include: switching to another DBMS (e.g., `duckdb`, which claims to be fairly fast, or perhaps a server-based DBMS such as postgres/mariadb/sql-server) or perhaps to a parquet file via the `arrow` package. The latter requires that something write all the data at some point, so may not be easy atm. – r2evans Feb 03 '23 at 14:01
  • Sorry, edited the number of rows. Does making even smaller databases improves overall performance (I still need to edit all 3 million rows) ? The query part is quite quick, it is the as.data.frame that takes a long time. – Dimitri Feb 03 '23 at 14:06
  • Do you really need to load all 3mi rows at once, or can you do it piece-meal? If "all", then is most of what you need to do "simple" analysis? If you're able to do it in raw SQL or in "lazy" dplyr (requiring a subsequent call to `collect`), you may be able to utilize the underlying engine (sql or arrow/parquet) to do most of the heavy lifting before bringing it in to R. Another thought: do you need all of the columns? Perhaps you can work on a subset of columns at a time? – r2evans Feb 03 '23 at 14:13
  • I have to apply a statistical model to my data, so I need it in data.frame format. I wanted to proceed by pieces : extract a sub-database from my SQL, apply the models to it, save the result in the computers'memory, delete intermediary databases and start over. I don't know if I answered your question ? – Dimitri Feb 03 '23 at 14:17
  • 1
    I think you did, and I think you have what you have (given what we know). If you can pull in fewer columns, do that. You say you can subset the data, that's good; I suggest after `rm`ing the object, call `gc()` to make sure it is garbage-collected (may not be truly necessary but certainly cannot hurt). The performance of pulling the data from SQLite is what it is, typically fast-enough but apparently not great for you. Again, reducing the number of columns/rows in the pull will reduce the time, I know of no "binary-fast-vroom turbo-charge" mode :-) – r2evans Feb 03 '23 at 14:21

1 Answers1

0

To process successive chunks of 70000 rows using con from the question replace the print statement below with any processing desired (base, dplyr, data.table, etc.) .

rs <- dbSendQuery(con, "select * from Test")
while(!dbHasCompleted(rs)) {
  dat <- dbFetch(rs, 70000)
  print(dim(dat)) # replace with your processing
}
dbClearResult(rs)
dbDisconnect(con)
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341