12

I have a large data frame in which I am multiplying two columns together to get another column. At first I was running a for-loop, like so:

for(i in 1:nrow(df)){
    df$new_column[i] <- df$column1[i] * df$column2[i]
}

but this takes like 9 days.

Another alternative was plyr, and I actually might be using the variables incorrectly:

new_df <- ddply(df, .(column1,column2), transform, new_column = column1 * column2)

but this is taking forever

divibisan
  • 11,659
  • 11
  • 40
  • 58
Doug
  • 597
  • 2
  • 7
  • 22
  • 7
    What's wrong with `df$new_column <- df$column1 * df$column2`? How big is your data frame? – Blue Magister Sep 10 '12 at 18:42
  • 400000 rows should be quick even with plyr... but if you're just trying to multiply two columns @BlueMagister has the best solution. Even still, a for loop version shouldn't take 9 days... – Justin Sep 10 '12 at 18:44
  • well. the 9 days was a slight exaggeration, forgot to mention I was jk-ing there – Doug Sep 10 '12 at 18:45
  • 1
    Most operations in `R` are vectorized, so you can multiply vectors by vectors and it will multiply entries of the same index together. The problem with the for-loop is that `R` creates a new data frame for every iteration of the loop. The solution I suggested creates just one new data frame instead of 400K new data frames. – Blue Magister Sep 10 '12 at 18:48
  • yes, it is beautiful. this is almost a good as sliced bread – Doug Sep 10 '12 at 18:48
  • 6
    "time-efficient" and "plyr" are not generally used in the same sentence. If speed is your goal, you should look at the 'data.table' package. (... although 400K x 7 is a tiny dataset these days, and ordinary functions as you have been offered in the answers below should suffice.) – IRTFM Sep 10 '12 at 18:57
  • 2
    The reason plyr is so slow here is that you are unneccarily grouping over column1 and column2. Ths creates groups for every unique combination of these columns. ddply is not required. data.table will do the creation of a new column efficiently. – mnel Sep 10 '12 at 20:11

5 Answers5

26

As Blue Magister said in comments,

df$new_column <- df$column1 * df$column2

should work just fine. Of course we can never know for sure if we don't have an example of the data.

Sacha Epskamp
  • 46,463
  • 20
  • 113
  • 131
  • 4
    Even more beautiful, but essentially the same: `df$new_column <- with( df , column1 * column2)` – IRTFM Sep 10 '12 at 18:59
  • @DWin that seems a little odd to use `with()` but then do an assign via `$<-`. Anything wrong with `within()` or `transform()?` – Gavin Simpson Sep 10 '12 at 19:07
  • If you are assigning to just a single (new or otherwise) column, I think you need to use `with` instead of `within`, because `within` will return the entire data.frame. That's my understanding, anyway. (...and testing confirms that dat$new <- within(dat, old*2) will make a messy copying of a nested dataframe copy within the 'dat' object. – IRTFM Sep 10 '12 at 19:40
  • Yes should work fine for the 20MB dataset (400k x 7 rows). But it will copy the whole 20MB, so don't repeat it too much. For example, adding 10 columns using `$<-` each time needs 400MB (`400e3*sum(7:17)*8/1024^2`). Adding 50 churns through 5GB (`400e3*sum(7:57)*8/1024^2`) to get to a 400k x 57 result (just 173MB). Just to be aware. – Matt Dowle Sep 11 '12 at 22:23
  • 2
    So, add columns in bulk if you can, not one by one. Since that can bite even at the 20MB size. Just in case that was going to be your next step. Or, use `:=` which doesn't copy the whole 20MB. – Matt Dowle Sep 11 '12 at 22:29
12

A data.table solution will avoid lots of internal copying while having the advantages of not spattering the code with $.

 library(data.table)
 DT <- data.table(df)
 DT[ , new := column1 * column2]
mnel
  • 113,303
  • 27
  • 265
  • 254
11

A minor, somewhat less efficient, version of Sacha's Answer is to use transform() or within()

df <- transform(df, new = column1 * column2)

or

df <- within(df, new <- column1 * column2)

(I hate spattering my user code with $.)

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
  • Why would this be less efficient? Is `$<-` optimized not to copy the data frame? – Gabor Csardi Sep 10 '12 at 19:01
  • 1
    There have been some recent optimisations of certain ops on data frames, but don't recall if that is one of them. I assume this will be slightly less efficient because `transform()` and `within()` contain quite a few additional lines of R code on top of the one that evaluates the expression. – Gavin Simpson Sep 10 '12 at 19:05
  • 2
    @GaborCsardi Unfortunately, `$<-` is not optimized not to copy in R; it copies the _entire_ `df`. As does `transform` and `within`; they copy the _entire_ `df` too. This is why `data.table` introduced `:=`, to allow assign by reference, as demo'd in mnel's answer. – Matt Dowle Sep 11 '12 at 21:44
  • @GavinSimpson All solutions other than `:=` copy the entire `df`, so therefore don't scale. The extra lines of code in `transform` and `within` are tiny in comparison. Unless the call is looped, such as in `transform`-by-group, but then it's even more dominated by all the copies. If `transform` _didn't_ copy _then_ the number of lines would come into it. – Matt Dowle Sep 11 '12 at 21:57
  • @MatthewDowle This is hardly a huge problem and just using base R solutions properly will save the OP vast amounts of time. I agree moving to data.table for these things will pay off handsomely, but it is an extra set of functions & syntax to master. I don't need to be sold on the data.table goodness; I already sign from that hymn book. I just need to spend some time integrating it into my work flow a bit more so the syntax stick. – Gavin Simpson Sep 11 '12 at 22:01
  • @GavinSimpson Sure, I was just answering the open points in your and Gabor's comments above. And I also had in mind other people reading this question who might have larger datasets or want to add more than 1 column. – Matt Dowle Sep 11 '12 at 22:50
  • @MatthewDowle: well, theoretically `$<-.data.table` _could_ be optimized not to copy the whole data frame in this case. All I was asking, whether this was done, or not. `data.table` has a lot of nice features indeed. If only the syntax was more R-like... – Gabor Csardi Sep 12 '12 at 03:42
  • @GaborCsardi Oh, `$<-.data.table`? No, that's not possible to not copy (as a mere package author), without a change to R itself. R itself makes the copy when dispatching the `$<-` method, iiuc. If R can be changed is the burning question. In the meantime I like data.table syntax, obviously. It's more like SQL than R. Some have said its syntax is _trivial_, which is nice to hear. – Matt Dowle Sep 12 '12 at 04:09
  • @MatthewDowle: sorry, typo, it is late. I was asking `<-.data.frame`, obviously. I don't know the R internals much, but I don't think copies are made unless the object is modified, so just dispatching _should_ not make a copy. Re the `data.table` syntax, I am not saying it is difficult, it is just different. – Gabor Csardi Sep 12 '12 at 04:48
  • but to be honest, these methods spatter brackets instead of dollar signs ;-) – PatrickT Jun 21 '16 at 08:41
1

You can simply create a function to handle all sort of multiplications like this on:

GetMultiplication <- function(x,y) {
x *y
}

# for example:
xCol<-c(1,2,3,4,5)
yCol<-c(10,20,30,40,50)
const = 0.055

#Case 1: Column 1 * Column 2
ZCol_1 <- GetMultiplication (xCol,yCol)
print(ZCol_1)
#> [1]  10  40  90 160 250

#Case 2: Column 1 * (Column 1 * 10 + 1000)
ZCol_2 <- GetMultiplication (xCol,xCol*10 + 1000)
print(ZCol_2)
#> [1] 1010 2040 3090 4160 5250

#Case 3: Column 1 * a constant value
ZCol_3 <- GetMultiplication (xCol,const)
print(ZCol_3)
#> [1] 0.055 0.110 0.165 0.220 0.275
Ali Safari
  • 1,535
  • 10
  • 19
1

this works with 2 or more numeric columns in a dataframe

 df$product <- apply(df,1,prod)
Stefano Verugi
  • 101
  • 1
  • 5