1

I have data in Cassandra database with one dependent variable(Continuous) and around 100 independent variables(Discrete). Data will be added into the database from various servers and I will get millions of data points each day.

I am planning to predict the dependent variable value given the independent variables values using the last 3 days data at any given day. I did some research and figured that Linear Regression is the best choice for me(Is it ?). I am thinking to use Python/R as the programming tool as they have the existing implementations.

Now my questions are

  1. I will have around 3 millions of samples to train the model every day. What is the best way of retrieving data from database and training the model ? What are my possible options in terms of implementation ?
  2. Can I make use of previously trained model weights for the next day training? If yes what are my options ?

Thanks In Advance.

Community
  • 1
  • 1
  • Your question is quite broad. If you have a specific thing in R or Python you are struggeling with feel free to post another question. – Paul Hiemstra Jul 07 '16 at 16:26
  • i suggest please check lm function in R. Its inbuilt function for regression. And for fetching the data plz check this link http://stackoverflow.com/questions/21994077/how-to-read-data-from-cassandra-with-r – Sahil Desai Sep 15 '16 at 08:32

0 Answers0