I have 4 billion rows of data in a 12 node redshift cluster. I am successfully able to connect to it with the Rpostgreqsql package and use dplyr to do basic data wrangling.
However, I'd like to do some data reshaping which I'd normally use reshape2 (dcast) or tidyr (spread) to do. I am finding that neither package executes on my database object. I could run "collect" but that would be problematic because that dataframe would be far too big to fit into memory (hence the reason I want to run in DB). My overall goal is to use dcast/spread to make the data wider while creating 0/1 flags in the process. This works like a charm with small samples of data on my machine, but not so well on DB.
Below is my code that I have working. Connecting to DB and doing basic filtering with dplyr. When I try to use tidyr/reshape2, R throws syntax errors that "type not recognized"
Redshift <- src_postgres('dev',
host = 'xxx.aws.com',
port = 5439,
user = "user",
password = "pwd")
### create table reference ###
df <- tbl(Redshift, "df_cj_allact")
# simple and default R commands analyzing data frames
dim(df)
colnames(df)
head(df)
df2 <- df %>% filter(id != '0') %>% arrange(id, timestamp, category) # seems to work!
# 2157398, was 2306109 (6% loss)