How to create a calculated variable in R where I sum based on criteria

Question

I have a data frame that looks like this:

query        page           clicks
apples       /apples        50
oranges      /oranges       25
apples       /bad-apples    40
bananas      /bananas       25
apples       /all-fruits    10

I need this, where a new column, querySum, is created that sums up all of the clicks for a particular query. So, in the example above, the row for apples should always show 100 (50+40+10).

query        page           clicks   querySum
apples       /apples        50       100
oranges      /oranges       25       25
apples       /bad-apples    40       100
bananas      /bananas       25       25
apples       /all-fruits    10       100

I tried dplyr, but it created a smaller table with just unique values. Is there a way to apply this to every row in a new column, perhaps using a loop?

df$querySum <- df %>% group_by(query) %>% summarise(querySum = sum(clicks))

In data.table, you can do: `df[, querySum := sum(clicks), query]` — YOLO, Apr 30 '18 at 03:06
Also, `dplyr` operations generate data frames, so you don't want to assign the results to a column. — neilfws, Apr 30 '18 at 03:06

score 1 · Accepted Answer · answered Apr 30 '18 at 03:12

Try the following:

library(tidyverse);
df %>% group_by(query) %>% mutate(querySum = sum(clicks))
## A tibble: 5 x 4
## Groups:   query [3]
#  query   page        clicks querySum
#  <fct>   <fct>        <int>    <int>
#1 apples  /apples         50      100
#2 oranges /oranges        25       25
#3 apples  /bad-apples     40      100
#4 bananas /bananas        25       25
#5 apples  /all-fruits     10      100

Or in base R you can use ave:

df$querySum = ave(df$clicks, list(df$query), FUN = sum);
df;
#    query        page clicks querySum
#1  apples     /apples     50      100
#2 oranges    /oranges     25       25
#3  apples /bad-apples     40      100
#4 bananas    /bananas     25       25
#5  apples /all-fruits     10      100

Sample data

df <- read.table(text =
    "query        page           clicks
apples       /apples        50
oranges      /oranges       25
apples       /bad-apples    40
bananas      /bananas       25
apples       /all-fruits    10", header = T)

I would use `ungroup()` at the end, i.e.: `df %>% group_by(query) %>% mutate(querySum = sum(clicks)) %>% ungroup()` to revert back to a non-grouped df. — byouness, Apr 30 '18 at 20:02
@uness Sure. Depending on further data manipulation steps, `ungroup`ing may be useful or even necessary. In OPs case, `ungroup` makes no difference, and it becomes more of a conceptual & code design question; i.e. should `group_by` always be followed up with `ungroup`. In cases such as this, I opt for conciseness. — Maurits Evers, Apr 30 '18 at 21:56

How to create a calculated variable in R where I sum based on criteria

1 Answers1

Sample data