How to retrieve more than 100000 rows from Redshift using R and dplyr

Question

I'm analyzing data from a Redshift database, working in R using a connection per dplyr - which works:

my_db<-src_postgres(host='my-cluster-blahblah.redshift.amazonaws.com', port='5439', dbname='dev',user='me', password='mypw')
mytable <- tbl(my_db, "mytable")

viewstation<-mytable %>%
    filter(stationname=="something")

When I try to turn that output into a data frame, so:

thisdata<-data.frame(viewstation)

I get the error message, Warning message:

Only first 100,000 results retrieved. Use n = -1 to retrieve all.

Where am I supposed to set n?

Note, I would indeed like to subset to fewer than 100000 points, but I cant do that: thisdata<-sample_n(viewstation,100000) generates the error, "Error: Don't know how to sample from objects of class tbl_postgres" — Lucy, Jul 17 '15 at 23:45
A better practice is to do the subset using the WHERE clause in your SELECT query to Redshift. It will prevent the unnecessary transfer of the data over the network and fill up the memory of your machine. — Guy, Jul 18 '15 at 07:46
but i want all that data :) I want all those hundreds of thousands of points. A random select to bring the number down is an OK second best merely. and if possible I'd rather stay within dplyr - so so easy for the use cases where it works.... — Lucy, Jul 18 '15 at 14:32
i know that you can set n in a query managed from rpostgresql directly - it is a parameter to "fetch" command. But how from dplyr can I get to that? — Lucy, Jul 18 '15 at 14:40

score 7 · Answer 1 · edited Jun 20 '20 at 09:12

7

Instead of using

thisdata<-data.frame(viewstation)

use

thisdata <- collect(viewstation)

collect() will pull all the data from the database back into R. As mentioned in the DPLYR::databases vignette:

When working with databases, dplyr tries to be as lazy as possible. It’s lazy in two ways:

It never pulls data back to R unless you explicitly ask for it.

It delays doing any work until the last possible minute, collecting together everything you want to do then sending that to the database in one step.

edited Jun 20 '20 at 09:12

Community

1
1

answered Jul 18 '15 at 10:51

phiver

23,048
14
44
56

ooo @phiver i think it works. followed by `thisdata_df<-data.frame(thisdata)` – Lucy Jul 18 '15 at 14:42

score 0 · Answer 2 · answered Jun 20 '17 at 15:28

For those who are still using dplyr 0.5 (like me).

The parameter n is part of the collect function.

my_db<-src_postgres(host='my-cluster-blahblah.redshift.amazonaws.com', port='5439', dbname='dev',user='me', password='mypw')
mytable <- tbl(my_db, "mytable") %>% collect(n = Inf)

This will get you more than 100.000 rows.

How to retrieve more than 100000 rows from Redshift using R and dplyr

2 Answers2

For those who are still using dplyr 0.5 (like me).