0

I'm used to Python and JS, and pretty new to R, but enjoying it for data analysis. I was looking to create a new field in my data frame, based on some if/else logic, and tried to do it in a standard/procedural way:

for (i in 1:nrow(df)) {
  if (is.na(df$First_Payment_date[i]) == TRUE) {
    df$User_status[i] = "User never paid"
  } else if (df$Payment_Date[i] >= df$First_Payment_date[i]) {
    df$User_status[i] = "Paying user"
  } else if (df$Payment_Date[i] < df$First_Payment_date[i]) {
    df$User_status[i] = "Attempt before first payment"
  } else {
    df$User_status[i] = "Error"
  }
}

But it was CRAZY slow. I tried running this on a data frame of ~3 million rows, and it took way, way too long. Any tips on the "R" way of doing this?

Note that the df$Payment_Date and df$First_Payment_date fields are formatted as dates.

Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
yashap
  • 45
  • 5
  • If you have 3 million rows, you might think about reducing the data in a shell before bringing it into R. Also, `for` loop speed in R is best minimized by allocating an empty vector to place the result. Last, you could probably use one or two `ifelse` statements and a subset for this. Summary: it's best to split up the 3million rows. – Rich Scriven Apr 12 '14 at 07:07
  • try to use the package `data.table`, it is 100x times faster than built-in `data.frame`. – Randy Lai Apr 12 '14 at 07:17

4 Answers4

2

If you initialize to "error" and then overwrite for the conditions enumerated using logical indexing this should be much faster. Those if(){}else{} statements for every row are killing you.

df$User_status <- "Error"
df$User_status[ is.na(df$First_Payment_date) ] <- "User never paid"
df$User_status[ df$Payment_Date >= df$First_Payment_date ] <- "Paying user"
df$User_status[ df$Payment_Date < df$First_Payment_date ] <- "Attempt before first payment"
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • you should remove the `[i]` – Randy Lai Apr 12 '14 at 07:45
  • Same line of thinking I was using. Except this is much better. Very nice. – Rich Scriven Apr 12 '14 at 07:50
  • Another potential problem: if `First_Payment_date` is `NA`, the second and third line will die. – Randy Lai Apr 12 '14 at 07:58
  • @RandyLai If that were true (although I don't think it is true) then at least half of the responsibility lies with the OP for not posting a test dataset. When I tested on a reduced version your dataset, I saw no evidence of failure. – IRTFM Apr 12 '14 at 19:11
1

I am benchmarking data.frame and data.table for relatively large dataset.

First we generate some data.

set.seed(1234)
library(data.table)
df = data.frame(First_Payment_date=c(sample(c(NA,1:100),1000000, replace=1)),
                 Payment_Date=c(sample(1:100,1000000, replace=1)))
dt = data.table(df)

Then set up the benchmark. I am testing between @BondedDust's answer and its data.table equivalence. I have slightly modified (debug) his code.

library(microbenchmark)

test_df = function(){
    df$User_status <- "Error"
    df$User_status[ is.na(df$First_Payment_date) ] <- "User never paid"
    df$User_status[ df$Payment_Date >= df$First_Payment_date ] <- "Paying user"
    df$User_status[ df$Payment_Date < df$First_Payment_date ] <- "Attempt before first payment"
}

test_dt = function(){
    dt[, User_status := "Error"]
    dt[is.na(First_Payment_date), User_status := "User never paid"]
    dt[Payment_Date >= First_Payment_date, User_status := "Paying user"]
    dt[Payment_Date < First_Payment_date, User_status := "Attempt before first payment"]
}

microbenchmark(test_df(), test_dt(), times=10)

The result: data.table is 4x faster than data.frame for the generated 1 million rows data.

> microbenchmark(test_df(), test_dt(), times=10)
Unit: milliseconds
      expr       min        lq    median        uq       max neval
 test_df() 247.29182 256.69067 287.89768 319.34873 330.33915    10
 test_dt()  66.74265  69.42574  70.27826  72.93969  80.89847    10

Note

data.frame is faster than data.table for small dataset (say, 10000 rows.)

Randy Lai
  • 3,084
  • 2
  • 22
  • 23
  • Great answer! I almost accepted @bondeddust's answer, but went with this one instead, as it cleaned up the bugs, and benchmarked. Note that even with 3 million rows, this implementation on a data frame is still really fast - going to a data table would likely only be necessary if you've got a really data set. – yashap Apr 12 '14 at 18:08
  • @yeshap: No compalints about choosing this as the better answer, but I do not think the `!is.na(df$First_Payment_date) &` clause is needed. Running without that unnecessary vector computation (x2) and looking at datasets up to 10,000 rows the `df` solution is faster. And even if I go to 1MM the dt solution is only about twice as fast which is significantly lower degree of improvement over typical dt comparisons. Most of the time is still spent in building those logical vectors. – IRTFM Apr 12 '14 at 19:20
  • I just did a more extensive test, BondedDust was right, `!is.na(df$First_Payment_date) &` is not necessary in this case. It is necessary if one apply the subset on the `data.frame`: `df[df$Payment_Date < df$First_Payment_date, 'User_status'] <- "Attempt before first payment"`. – Randy Lai Apr 12 '14 at 23:39
0

I'm not certain that this will speed it up a lot, but you should see some improvement over the for loop you had before. The else's aren't really necessary under these conditions.

Also, R has functions that act as for loops, and other types of loops. See ?apply.

Give this a shot, see how it works. I can't test it since we don't have your data.

> df$User_status[i] <- rep("Error", nrow(df)) 
      ## allocate a vector, fill it with "Error"

> sapply(seq(nrow(df)), function(i){

    if(is.na(df$First_Payment_date[i])){ 
      gsub("Error", "User never paid", df$User_status[i]) }

    if(df$Payment_Date[i] >= df$First_Payment_date[i]){
      gsub("Error", "Paying user", df$User_status[i]) }

    if (df$Payment_Date[i] < df$First_Payment_date[i]) {
      gsub("Error", "Attempt before first payment", df$User_status[i]) }

    })
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
0

The usual way to handle this sort of thing is via ifelse.

df$User_status <- with(df,
    ifelse(is.na(First_Payment_date), "User never paid",
    ifelse(Payment_Date >= First_Payment_date, "Paying user",
    ifelse(Payment_Date < First_Payment_date, "Attempt before first payment",
    "Error"))))
Hong Ooi
  • 56,353
  • 13
  • 134
  • 187