Is it feasible to run logistic regression on a Latop with 4GB RAM

Question

I am trying to perform a logistic regression in R on my data. I have created all the model variables and have them in place in a table on my Redshift database. Lets refer to this database as 'Database A' and the table as 'Table A'

Problem Statement

Is it feasible to run logistic regression on a laptop with 4 GB RAM

What I don't want to do

I don't want to wait for my query to execute, and wait for it to display all the records. I have around 2 million records. I am not interested in right-clicking and then saving the results as a CSV file. I think this is really time consuming.

My research and the dplyr package.

I have gone through this blog about connecting R to amazon Redshift It talks about establishing a connection through the RJDBC package. I am connecting to Redshift from my personal laptop. I am providing the R version on my laptop for your reference. The version command on my laptop outputs the following.

platform       x86_64-w64-mingw32          
arch           x86_64                      
os             mingw32                     
system         x86_64, mingw32             
status                                     
major          3                           
minor          2.5                         
year           2016                        
month          04                          
day            14                          
svn rev        70478                       
language       R                           
version.string R version 3.2.5 (2016-04-14)
nickname       Very, Very Secure Dishes

I was able to create a connection to redshift. I used the tbl function to create an R object which points to my 'Table A' in the Amazon Redshift. The sudo code is below

myRedshift <- src_postgres('Database A',
host = 'Host_name',
port = Portnumber,
user = "XXXX", 
password = "XXXX")

my.data <- tbl(myRedshift, "Table A")

This works fine. I checked the dimensions. They were correct.

What I did next was

I tried to use the tbl_df function to store the values of the my.data object in a data frame in R to perform logistic regression. But the operation just kept running for more than 50 minutes. I aborted R

I also tried to chain the results into a dataframe as

new.data <- my.data %>% select(*)

But this gave me errors. I have more than 15 columns and I don't want to type out each column's name.

I searched online and came across SparkR It seemed like it could help me I was following the instructions mentioned in this link. But when I run the .\bin\sparkR command on my windows cmd terminal. I get an error saying

Access is denied
The system cannot find the file 'C:\Users\Name\AppData\Local\Temp'
The system cannot find the path specified.

How should I rectify this error ? What is an efficient method to store the data from my table in Redshift for me to perform logistic regression ? I know of an unload function which outputs pipe delimited files, What should I ask my IT department for using the unload function?

It seems that you are succesfully loading the data frame, which is at odds with the problem statement of "Look for an efficient way to get the data from Table A in Redshift's Database A inside my 'R' environment". It seems your problem is more around the amount of time it takes to run a logistic regression. To answer what appears to be the real question "Is it feasible to do logistic regression with 2 million observations on a laptop with 4GB RAM." the answer will depend on not just the number of observations, but also the number of variables, and the R function being used — SpiritusPrana, Jun 15 '16 at 13:04
Please could you add (a) the formula that you are using for the regression (b) the results of str(my.data) and (c) the syntax of the R command you are using to run the logistic regression. — SpiritusPrana, Jun 15 '16 at 13:09
Also, SparkR will help you if you have access to a distributed processing cluster. Spark is designed to efficiently distribute analytical workload across a cluster. When running in local mode on a laptop it won't add much efficiency, and could even be slower than base R functions. — SpiritusPrana, Jun 15 '16 at 13:12
Sure just give me some more time. I will try getting back to you within 6 hours. — Ajay Kumar, Jun 16 '16 at 01:04
Thanks for the observation. I confused myself with the interpretation of the `tbl` function. I looked online and the `dplyr::collect` has come to my rescue. The model seems to be working. The `mutate` and `transform` functions are also handy to me. I am using `glmnet` for a penalized logistic regression. — Ajay Kumar, Jun 16 '16 at 07:32

score -1 · Answer 1 · answered Jun 17 '16 at 02:18

So I do have a temporary solution to my problem

As was highlighted, the main issue was the time taken and I was unclear about how to interpret the tbl function.

What helped me use dplyr with Redshift was the collect function. I could then use mutate and transform very easily. Ins stead of using the baseR approach of str what helps is the glimpse function. It returns whether my.data is a dataframe, the dimensions and the data types of all variables.

However collect takes a lot of time to create the predictors. I had 6 predictors in my model with more than a million rows. I tried to create my regressors as

regressors <- my.data %>%
                filter(required filters) %>%
                select(predictor1, ..., predictor6) %>% collect

But this was taking more than an hour. I aborted the operation. I added the predictors one by one

predictor1 <- my.data %>% filter(required filters) %>%
              select(predictor1) %>% collect

I did this for all the 6 predictors each containing more than 1 million records. The 'numericpredictors took around 5 minutes to be created. But predictors which were of typecharacter` took 10-15 minutes to be created.

For the modeling I am using the glmnet function which works fine.

If there is a better way to create my predictor matrix let me know.

Is it feasible to run logistic regression on a Latop with 4GB RAM

Problem Statement

What I don't want to do

My research and the dplyr package.

What I did next was

1 Answers1