How to use tidyr (or similar data wrangling) on "big" data in postgreSQL (Redshift) database

Question

I have 4 billion rows of data in a 12 node redshift cluster. I am successfully able to connect to it with the Rpostgreqsql package and use dplyr to do basic data wrangling.

However, I'd like to do some data reshaping which I'd normally use reshape2 (dcast) or tidyr (spread) to do. I am finding that neither package executes on my database object. I could run "collect" but that would be problematic because that dataframe would be far too big to fit into memory (hence the reason I want to run in DB). My overall goal is to use dcast/spread to make the data wider while creating 0/1 flags in the process. This works like a charm with small samples of data on my machine, but not so well on DB.

Below is my code that I have working. Connecting to DB and doing basic filtering with dplyr. When I try to use tidyr/reshape2, R throws syntax errors that "type not recognized"

    Redshift <- src_postgres('dev',
                         host = 'xxx.aws.com',
                         port = 5439,
                         user = "user", 
                         password = "pwd")


### create table reference ###
df <- tbl(Redshift, "df_cj_allact")

# simple and default R commands analyzing data frames
dim(df)
colnames(df)
head(df)

df2 <- df %>% filter(id != '0') %>% arrange(id, timestamp, category)  # seems to work!
# 2157398, was 2306109 (6% loss)

score 2 · Answer 1 · answered Feb 17 '16 at 10:02

2

tidyr package doesn't support database backend. You could only manipulate in memory data. dplyr works with database tables as well as with in memory objects. You could try either use machine with larger memory (say on AWS) and use data.table, or think about splitting your data.

answered Feb 17 '16 at 10:02

kismsu

1,049
7
22

Thanks for your comment. Regarding using machine with larger memory + data.table...this could work. I have a AWS EC2 instance that is memory heavy provisioned. Is there an advantage to using data.table with that vs. tidyr? – leaRningR909 Feb 17 '16 at 16:27
data.table designed for very large objects 10s millions of rows. It has dcast and melt functions optimized for data.table objects, and there work the same as dcast/melt from reshape2. I think tidyr functions are more logical, but it might be personal preference, however, I'm not sure that it works with native data.table objects, instead it may first convert to tbl_df class. – kismsu Feb 17 '16 at 17:01
1

There is an [issue](https://github.com/tidyverse/tidyr/issues/189) related to this. It seems like something that would be natural to have in `tidyr`. – Ian Gow Mar 19 '17 at 19:41

How to use tidyr (or similar data wrangling) on "big" data in postgreSQL (Redshift) database

1 Answers1