I am trying to perform a logistic regression in R
on my data.
I have created all the model variables and have them in place
in a table on my Redshift database.
Lets refer to this database as 'Database A' and the table as
'Table A'
Problem Statement
Is it feasible to run logistic regression on a laptop with 4 GB RAM
What I don't want to do
I don't want to wait for my query to execute, and wait for it to display all the records. I have around 2 million records. I am not interested in right-clicking and then saving the results as a CSV file. I think this is really time consuming.
My research and the dplyr package.
I have gone through this blog about connecting R to amazon Redshift
It talks about establishing a connection through the RJDBC
package.
I am connecting to Redshift from my personal laptop. I am providing the
R version on my laptop for your reference. The version
command on my laptop outputs the following.
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 2.5
year 2016
month 04
day 14
svn rev 70478
language R
version.string R version 3.2.5 (2016-04-14)
nickname Very, Very Secure Dishes
I was able to create a connection to redshift. I used the tbl
function to create an R object which points to my 'Table A' in the Amazon Redshift.
The sudo code is below
myRedshift <- src_postgres('Database A',
host = 'Host_name',
port = Portnumber,
user = "XXXX",
password = "XXXX")
my.data <- tbl(myRedshift, "Table A")
This works fine. I checked the dimensions. They were correct.
What I did next was
I tried to use the tbl_df
function to store the values of the
my.data
object in a data frame in R to perform logistic regression.
But the operation just kept running for more than 50 minutes. I aborted R
I also tried to chain the results into a dataframe as
new.data <- my.data %>% select(*)
But this gave me errors. I have more than 15 columns and I don't want to type out each column's name.
I searched online and came across SparkR
It seemed like it could help me
I was following the instructions mentioned in this link. But when
I run the .\bin\sparkR
command on my windows cmd
terminal. I get an
error saying
Access is denied
The system cannot find the file 'C:\Users\Name\AppData\Local\Temp'
The system cannot find the path specified.
How should I rectify this error ?
What is an efficient method to store the data from my table in Redshift
for me to perform logistic regression ?
I know of an unload
function which outputs pipe delimited files,
What should I ask my IT department for using the unload function?