I have a .csv
file that contains 105M rows and 30 columns that I would like to query for plotting in an R shiny app.
it contains alpha-numeric data that looks like:
#Example data
df<-as.data.frame(percent=as.numeric(rep(c("50","80"),each=5e2)),
maskProportion=as.numeric(rep(c("50","80")),each=5e2),
dose=runif(1e3),
origin=as.factor(rep(c("ABC","DEF"),each=5e2)),
destination=as.factor(rep(c("XYZ","GHI"),each=5e2))
)
write.csv(df,"PassengerData.csv")
In the terminal, I have ingested it into an SQLite database as follows:
$ sqlite3 -csv PassengerData.sqlite3 '.import PassengerData.csv df'
which is from:
Creating an SQLite DB in R from an CSV file: why is the DB file 0KB and contains no tables?
So far so good.
The problem I have is speed in querying in R so I tried indexing the DB back in the terminal.
In sqlite3, I tried creating indexes on percent, maskProportion, origin and destination following this link https://data.library.virginia.edu/creating-a-sqlite-database-for-use-with-r/ :
$ sqlite3 create index "percent" on PassengerData("percent");
$ sqlite3 create index "origin" on PassengerData("origin");
$ sqlite3 create index "destination" on PassengerData("destination");
$ sqlite3 create index "maskProp" on PassengerData("maskProp");
I run out of disk space because my DB seems to grow in size every time I make an index. E.g. after running the first command the size is 20GB. How can I avoid this?