0

I have a large dataframe (10,000,000+ rows) that I would like to process. I'm also fairly new to R, and want to better understand how to work with large datasets like this.

I have a formula that I want to apply to each row in the dataframe. But I've found from experience that "for loops" and "apply" don't work all that well with really large datasets. I've been trying to wrap my head around Split-Apply-Combine, but I can't quite follow how to use it when I want to apply a function row-by-row.

Here's an example dataframe that has 1,000,000 rows. I'd like to apply a function that takes each row, and performs a simple multiplication on two columns to give an output (I realize I could do this much-easier, but I want to practice Split-Apply-Combine).

#make a dataframe
df <- data.frame("a"=c(rep("group1",times=500000),rep("group2",times=500000)),
                 "b"=c(1:1000000),"c"=c(1000001:2000000))

What I want to do: for each row, I want to take the value in column "b" and multiply it by the value in column "c"

user20650
  • 24,654
  • 5
  • 56
  • 91
Andrew
  • 85
  • 2
  • 6
  • 8
    If you're doing pairwise multiplication there's no reason to split/group-by. I recognize that you said you're looking to practice, but the most important part of programming efficiently in R is just using vectorized operations when possible. `df$b*df$c` – Michael Feb 12 '20 at 19:38
  • 2
    If you'd like to learn more about efficient group-by operations, I'd take a look at the `data.table` package. Specifically, `data.table` gforce optimization which allows a small number of mathematical group-by operations like sum, mean, etc to be calculated very efficiently. – Michael Feb 12 '20 at 19:38
  • What's the point of the "split" here? – iod Feb 12 '20 at 19:45
  • What if I have a more-complicated dataframe to work with? For example, let's say I have a dataframe with 2 categories (A & B), and I want to look up a value from a "lookup dataframe" using these two categories. With a small dataset, I'd simply go row-by-row (using for loops or apply), look up "category A" and "category B" in the lookup dataframe, and then insert whatever value came from the lookup dataframe. However, this wouldn't work so fast with a big dataframe.... – Andrew Feb 13 '20 at 22:19

1 Answers1

0

You don't need to use apply or other functions. For a small example:

df <- data.frame("a"=c(rep("group1",times=5),rep("group2",times=5)),
                 "b"=c(1:10),"c"=c(11:20))
df
       a  b  c
1  group1  1 11
2  group1  2 12
3  group1  3 13
4  group1  4 14
5  group1  5 15
6  group2  6 16
7  group2  7 17
8  group2  8 18
9  group2  9 19
10 group2 10 20

I can simply do this:

df$d = df$b *df$c #create a new column called d
df
       a  b  c   d
1  group1  1 11  11
2  group1  2 12  24
3  group1  3 13  39
4  group1  4 14  56
5  group1  5 15  75
6  group2  6 16  96
7  group2  7 17 119
8  group2  8 18 144
9  group2  9 19 171
10 group2 10 20 200
Will
  • 1,619
  • 5
  • 23