3

I have a 9GB .csv file and would like to convert it to an sqlite data base.

I have followed https://datacarpentry.org/R-ecology-lesson/05-r-and-databases.html#Connecting_to_databases and it works on my local machine but on a server it says the disk/database is full. In any case, on the local machine, the DB file appears to be 0KB and contains no tables. Any thoughts why?

This is what I do:

library(dplyr)
library(dbplyr)

#Test data

df<-as.data.frame(x1=runif(1e7),x2=runif(1e7))

#Make DB
PassengerData <- src_sqlite("FINAL_data.sqlite", create = TRUE)

#Copy dataframe to DB
copy_to(PassengerData,df)

# add my data.frame as a table
PassengerData<-tbl(PassengerData,"df")

Then I close R and open a new session:

To Look at the DB I do:

df<-DBI::dbConnect(RSQLite::SQLite(), "FINAL_data.sqlite")

src_dbi(df)
src:  sqlite 3.34.1 [Data/FINAL_data.sqlite]
tbls:

There are no tables in it. Why?

HCAI
  • 2,213
  • 8
  • 33
  • 65
  • 2
    Check the documentation for ```copy_to```, it creates a temporary table as a default. You should do this way ```copy_to(PassengerData,df, temporary = FALSE)``` – José Feb 04 '21 at 14:58
  • 1
    Another thing to try is: https://stackoverflow.com/questions/65776472/setting-up-an-sqllite-database-in-r-from-a-csv-file-that-cannot-read-into-memory/65776932#65776932 – G. Grothendieck Feb 04 '21 at 15:11
  • @Grothendieck thanky you for the link. I tried this first but it never completes for some reason. I'm trying to use the data in a reactive shiny app that filters it down to 10k rows and plots it. Do you think SQL is not the way forward? – HCAI Feb 04 '21 at 20:27

1 Answers1

3

I suggest that the prospect of loading an entire (9GB) file into R for the sole purpose of ingesting into a SQLite3 database is a little flawed (you may not have sufficient memory to load into R). Instead, use sqlite3 by itself.

I have pre-made a file mt.csv from mtcars.

$ sqlite3 -csv mt.sqlite3 '.import mt.csv mtcars'
$ ls -l mt.sqlite3
-rw-r--r-- 1 r2 r2 8192 Feb  4 12:19 mt.sqlite3

$ sqlite3 -column -header mt.sqlite3 'select * from mtcars limit 3'
mpg         cyl         disp        hp          drat        wt          qsec        vs          am          gear        carb
----------  ----------  ----------  ----------  ----------  ----------  ----------  ----------  ----------  ----------  ----------
21          6           160         110         3.9         2.62        16.46       0           1           4           4
21          6           160         110         3.9         2.875       17.02       0           1           4           4
22.8        4           108         93          3.85        2.32        18.61       1           1           4           1

The sqlite3 binary is not installed by default on many (any?) systems, but it is an easy and free download/installation (see https://www.sqlite.org/download.html).

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 1
    Thank you, I hadn't thought of it outside of R. It has worked very nicely! The fact that collecting the filtered data for plotting is prohibitively slow is another question I think. – HCAI Feb 04 '21 at 21:54
  • 1
    Yes, I understand. SQLite does support keys and indexing, you may want to look into adding non-clustered keys or such to speed up the queries. You may also want to consider [DuckDB](https://duckdb.org/) (and [`duckdb`](https://cran.r-project.org/web/packages/duckdb/index.html)!), which is very similar to SQLite3 but claims significant speed differences. Lacking that ... you might also want a larger server-based DBMS, but ... that is also a topic for another question. Glad it worked for you. – r2evans Feb 04 '21 at 23:11
  • Thank you for the suggestions. Can you suggest a reference of adding the non-clustered keys from R (or sqlite3 itself) as I'm clearly not googling the right thing. If that doesn't help I will try DuckDB. – HCAI Feb 05 '21 at 09:07
  • I found a resource that told me to do: sqlite> create index [Registration State] on pvi([Registration State]); But this has made a big mess of my database and R putting a t\ infront of every entry. Any thoughts are much appreciated. – HCAI Feb 05 '21 at 11:09
  • 1
    I don't what "made a big mess" means. The leading `t\` is new to me, sorry. I typically don't name indices the exact same as the field, but I would not expect that to collide or have a destructive effect. The question of what to index and how to do it is based on the actual data and how you plan to query it. It might help to start a new question asking about that step, and it would be very informed to see representative data (not necessarily volume, and you can obfuscate the real numbers/strings if need be, but ... patterns and sequences in the data should be known). – r2evans Feb 05 '21 at 13:54
  • 1
    Thank you, I will open a new question to make things clear. – HCAI Feb 05 '21 at 14:06