-1

I am trying to perform a logistic regression in R on my data. I have created all the model variables and have them in place in a table on my Redshift database. Lets refer to this database as 'Database A' and the table as 'Table A'

Problem Statement

Is it feasible to run logistic regression on a laptop with 4 GB RAM

What I don't want to do

I don't want to wait for my query to execute, and wait for it to display all the records. I have around 2 million records. I am not interested in right-clicking and then saving the results as a CSV file. I think this is really time consuming.

My research and the dplyr package.

I have gone through this blog about connecting R to amazon Redshift It talks about establishing a connection through the RJDBC package. I am connecting to Redshift from my personal laptop. I am providing the R version on my laptop for your reference. The version command on my laptop outputs the following.

platform       x86_64-w64-mingw32          
arch           x86_64                      
os             mingw32                     
system         x86_64, mingw32             
status                                     
major          3                           
minor          2.5                         
year           2016                        
month          04                          
day            14                          
svn rev        70478                       
language       R                           
version.string R version 3.2.5 (2016-04-14)
nickname       Very, Very Secure Dishes 

I was able to create a connection to redshift. I used the tbl function to create an R object which points to my 'Table A' in the Amazon Redshift. The sudo code is below

myRedshift <- src_postgres('Database A',
host = 'Host_name',
port = Portnumber,
user = "XXXX", 
password = "XXXX")

my.data <- tbl(myRedshift, "Table A")

This works fine. I checked the dimensions. They were correct.

What I did next was

I tried to use the tbl_df function to store the values of the my.data object in a data frame in R to perform logistic regression. But the operation just kept running for more than 50 minutes. I aborted R

I also tried to chain the results into a dataframe as

new.data <- my.data %>% select(*)

But this gave me errors. I have more than 15 columns and I don't want to type out each column's name.

I searched online and came across SparkR It seemed like it could help me I was following the instructions mentioned in this link. But when I run the .\bin\sparkR command on my windows cmd terminal. I get an error saying

Access is denied
The system cannot find the file 'C:\Users\Name\AppData\Local\Temp'
The system cannot find the path specified.

How should I rectify this error ? What is an efficient method to store the data from my table in Redshift for me to perform logistic regression ? I know of an unload function which outputs pipe delimited files, What should I ask my IT department for using the unload function?

Ajay Kumar
  • 71
  • 1
  • 3
  • 11
  • It seems that you are succesfully loading the data frame, which is at odds with the problem statement of "Look for an efficient way to get the data from Table A in Redshift's Database A inside my 'R' environment". It seems your problem is more around the amount of time it takes to run a logistic regression. To answer what appears to be the real question "Is it feasible to do logistic regression with 2 million observations on a laptop with 4GB RAM." the answer will depend on not just the number of observations, but also the number of variables, and the R function being used – SpiritusPrana Jun 15 '16 at 13:04
  • Please could you add (a) the formula that you are using for the regression (b) the results of str(my.data) and (c) the syntax of the R command you are using to run the logistic regression. – SpiritusPrana Jun 15 '16 at 13:09
  • Also, SparkR will help you if you have access to a distributed processing cluster. Spark is designed to efficiently distribute analytical workload across a cluster. When running in local mode on a laptop it won't add much efficiency, and could even be slower than base R functions. – SpiritusPrana Jun 15 '16 at 13:12
  • Sure just give me some more time. I will try getting back to you within 6 hours. – Ajay Kumar Jun 16 '16 at 01:04
  • Thanks for the observation. I confused myself with the interpretation of the `tbl` function. I looked online and the `dplyr::collect` has come to my rescue. The model seems to be working. The `mutate` and `transform` functions are also handy to me. I am using `glmnet` for a penalized logistic regression. – Ajay Kumar Jun 16 '16 at 07:32
  • I will update you about my progress by the end of day – Ajay Kumar Jun 16 '16 at 07:34

1 Answers1

-1

So I do have a temporary solution to my problem

As was highlighted, the main issue was the time taken and I was unclear about how to interpret the tbl function.

What helped me use dplyr with Redshift was the collect function. I could then use mutate and transform very easily. Ins stead of using the baseR approach of str what helps is the glimpse function. It returns whether my.data is a dataframe, the dimensions and the data types of all variables.

However collect takes a lot of time to create the predictors. I had 6 predictors in my model with more than a million rows. I tried to create my regressors as

regressors <- my.data %>%
                filter(required filters) %>%
                select(predictor1, ..., predictor6) %>% collect

But this was taking more than an hour. I aborted the operation. I added the predictors one by one

predictor1 <- my.data %>% filter(required filters) %>%
              select(predictor1) %>% collect

I did this for all the 6 predictors each containing more than 1 million records. The 'numericpredictors took around 5 minutes to be created. But predictors which were of typecharacter` took 10-15 minutes to be created.

For the modeling I am using the glmnet function which works fine.

If there is a better way to create my predictor matrix let me know.

Ajay Kumar
  • 71
  • 1
  • 3
  • 11