0

I'm trying to build a predictive model with a customer database.

I have a dataset with 3,000 customers. Each customers have 300 observations and 20 variables (including dependent variable) in a test dataset. I also have a score dataset that has 50 observation with 19 variables (excludes dependent variable) for each unique cutomer ID. I have the test dataset in a separate file with each customer identified by a unique ID variable similarly the score dataset is identified by a unique id variable.

I'm developing a RandomForest based predictive model. Below is the sample for a single customer. I'm not sure how I could automatically apply to the model for each customer and predict and store the model effeciently as well.

    install.packages(randomForest)
    library(randomForest)
    sales <- read.csv("C:/rdata/test.csv", header=T)
    sales_score <- read.csv("C:/rdata/score.csv", header=T)

  ## RandomForest for Single customer

    sales.rf <- randomForest(Sales ~ ., ntree = 500, data = sales,importance=TRUE)
    sales.rf.test <- predict(sales.rf, sales_score)

I have very good familiarity with SAS and beginning to learn R. For SAS progremmers, there are many SAS procedures that come with by group processing for example:

proc gam data = test;
by id;
model y = x1  x2 x3;
score data = test  out = pred;
run;

This SAS program would develop a gam model for each unique iD and apply them to the test set for each unique ID. Is there an R equivalent ?

I would greatly appreciate any example or thoughts?

Thanks so much

forecaster
  • 1,084
  • 1
  • 14
  • 35
  • 1
    The only "non-obvious" command you should need for this is `split`. Beyond that, this is nothing more than a `for` loop. Just be sure to pre-allocate the lists that will hold the models and predicted values for each customer. (Also, with large numbers of variables, avoid using the formula interface to `randomForest`. It's often much slower.) – joran Mar 23 '14 at 02:49
  • Jordan, thank you. Can you please elobrate on avoiding formula interface ? – forecaster Mar 23 '14 at 02:55
  • 1
    Since you're new to R, you should be studying the documentation. The note about the formula interface is right there in the Note section. Read about the arguments `x` and `y` (right at the top of the docs) for alternatives. – joran Mar 23 '14 at 03:10

1 Answers1

2

Assuming your sales dataset is 3,000 * 300 = 900,000 rows and both dataframes have a customer_id column, you can do something like:

pred_groups <- split(seq_len(nrow(sales_score)), sales_score$customer_id)
# pred_groups is now a list, with names the customer_id's and each list
# element an integer vector of row numbers. Now iterate over each customer
# and make predictions on the training set.
preds <- unsplit(structure(lapply(names(pred_groups), function(customer_id) {
  # Train using only observations for this customer.
  # Note we are comparing character to integer but R's natural type
  # coercion should still give the correct answer.
  train_rows <- sales$customer_id == customer_id
  sales.rf <- randomForest(Sales ~ ., ntree = 500,
                           data = sales[train_rows, ],importance=TRUE)

  # Now make predictions only for this customer.
  predict(sales.rf, sales_score[pred_groups[[customer_id]], ])
}), .Names = names(pred_groups)), sales_score$customer_id)

print(head(preds)) # Should now be a vector of predicted scores of length
  # the number of rows in the train set.

Edit: Per @joran, here is a solution with a for:

pred_groups <- split(seq_len(nrow(sales_score)), sales_score$customer_id)
preds <- numeric(nrow(sales_score))
for(customer_id in names(pred_groups)) {
  train_rows <- sales$customer_id == customer_id
  sales.rf <- randomForest(Sales ~ ., ntree = 500,
                           data = sales[train_rows, ],importance=TRUE)
  pred_rows <- pred_groups[[customer_id]]
  preds[pred_rows] <- predict(sales.rf, sales_score[pred_rows, ])
})
Robert Krzyzanowski
  • 9,294
  • 28
  • 24
  • It is not unnecessary. Since we are in an `lapply`, using `<-` would not modify `preds` at all. Your misconception is that `<<-` is global variable assignment and thus discouraged; this is wrong. If you define your variable in a parent scope, it only modifies that variable. Wrapping all of this code in a `function()` would not assign a global `preds` variable. Alternatively, you can replace `lapply` with a `for` or use `eval.parent`; note that `assign` does not work because it does not replace scope. You can even use `do.call` on the parent environment. In any case, your decision is draconian. – Robert Krzyzanowski Mar 23 '14 at 03:31
  • Rules should only be followed when it is proper, but it is important to know when to break them. For example, it is actually impossible to do anything with reference classes usefully without `<<-`! – Robert Krzyzanowski Mar 23 '14 at 03:32
  • You're assuming a lot about my knowledge and beliefs about <--. I would not choose to use lapply in this circumstance, so my statement that it js not necessary is correct. For a task like this, the speed difference between lapply and a simple for loop will likely pale in comparison to fitting the model, and the for loop code will be much clearer and more readable. – joran Mar 23 '14 at 03:36
  • @joran Point taken. I have added a version without `<<-` and a version with a `for` loop for completeness's sake. – Robert Krzyzanowski Mar 23 '14 at 03:44
  • Oops, with regards to the character v.s. numeric conversion, I forgot that if you have a `customer_id` over 100,000, this could be an issue! http://stackoverflow.com/questions/18964562/why-does-1-99-999-1-99-999-in-r-but-100-000-100-000 – Robert Krzyzanowski Mar 23 '14 at 04:18
  • @RobertKrzyzanowski the program works perfectly. Thanks so much. – forecaster Mar 23 '14 at 15:42
  • @RobertKrzyzanowski I have only have 3,000 unique customer_ids however each customer-id has 300 observations, therefore 3,000 * 300 = 900,000. would this be a problem ? Thanks – forecaster Mar 25 '14 at 22:33