-2

A simplified version of my data looks like:

   year   title  name 
   2019     x     a
   2019     y     b
   2018     x     a
   2018     y     a
   2017     x     c
   2017     y     a
   2016     x     a
   2016     y     b

I would like to create a new dataframe that would look like this:

   name   title  year   runtot 
   a        x    2016     1
   a        x    2017     1
   a        x    2018     2
   a        x    2019     3
   a        y    2016     0
   a        y    2017     1
   a        y    2018     2
   a        y    2019     2
   a        xy   2016     1
   a        xy   2017     2
   a        xy   2018     4
   a        xy   2019     5
   b        x    2016     0
   b        x    2017     0
   b        x    2018     0
   b        x    2019     0
   b        y    2016     1
   b        y    2017     1
   b        y    2018     1
   b        y    2019     2
   b        xy   2016     1
   b        xy   2017     1
   b        xy   2018     1
   b        xy   2019     2
   c ...

Regarding runtot, the new column I want to create, I would like to do the running total for each of the values in the name column. In other words, I would like to know how the running total grows separately for a, how it grows for b, how it grows for c, etc.

I have tried subsetting the data but I could not get an approximate result of what I want.

Any ideas or suggestions?

console.log
  • 177
  • 2
  • 16
  • Your desired output doesn't seem to match your example data – heds1 Aug 20 '19 at 02:43
  • I think I figured it out... grouped by name, OP wants the occurrence of each title to appear in all subsequent years, with duplicates incrementing the cumulative sum. Name `a` has an `x` in 2016, 2018, and 2019 in the input, so the output has `1` values for `name a title x` in 2016 and 2017 (the `1` carries forward), and then cumsum becomes `2` in 2018 and 3 and 2019, as the other `name a title x` values join in. There's an extra layer of complication in that OP wants titles `x` and `y` counted both individually *and* together... which introduces some confusion (see next comment) – Gregor Thomas Aug 20 '19 at 02:57
  • If there were 3 titles, `x, y, z`, would you want all combinations, `x`, `y`, `z`, `xy`, `xz`, `yz`, `xyz`, or just the individuals and everything, `x`, `y`, `z`, `xyz`, or something else? Bearing in mind that if the answer is all combinations the size could get out of hand pretty quickly... – Gregor Thomas Aug 20 '19 at 02:59
  • That's it. Just the individuals and everything. I am only interested in `x`, `y`, `z`, and `xyz`. Thanks! – console.log Aug 21 '19 at 02:02

2 Answers2

0

Here's a way to get the individual values. I'll leave the combinations for a more ambitious answerer (and to wait on clarification from OP):

library(dplyr)
input %>% mutate(value = 1) %>%
  tidyr::complete(year, title, name) %>%
  arrange(name, title, year) %>%
  group_by(name, title) %>%
  mutate(runtot = cumsum(coalesce(value, 0))) %>%
  filter(runtot > 0) %>%
  select(-value)
# # A tibble: 14 x 4
# # Groups:   name, title [4]
#     year title name  runtot
#    <int> <fct> <fct>  <dbl>
#  1  2016 x     a          1
#  2  2017 x     a          1
#  3  2018 x     a          2
#  4  2019 x     a          3
#  5  2017 y     a          1
#  6  2018 y     a          2
#  7  2019 y     a          2
#  8  2016 y     b          1
#  9  2017 y     b          1
# 10  2018 y     b          1
# 11  2019 y     b          2
# 12  2017 x     c          1
# 13  2018 x     c          1
# 14  2019 x     c          1

Using this data:

input = read.table(text = 'year   title  name 
   2019     x     a
   2019     y     b
   2018     x     a
   2018     y     a
   2017     x     c
   2017     y     a
   2016     x     a
   2016     y     b', header= T)
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • Thanks so much, Gregor. That helps a lot. The only issue with this solution is that I would like to do the running total for each of the values in the `name` column. Now the result of `runtot` is not taking into consideration the different values in `name`. In other words, I would like to know how the running total grows for a, how it grows for b, c, etc. Sorry I should have been more specific. – console.log Aug 21 '19 at 02:00
  • Whoops, left out by `group_by` line. See updates. (Still not doing the everything, just the individuals...) – Gregor Thomas Aug 21 '19 at 17:07
0
# Using the example with 'xy' in the 'title' column.
df <- data.frame('Name' = as.character(c('a','b','a','a','c','a','a','b','a','b','c','a')),
                 'Title' = as.character(rep(c('x','y','xy'),4)),
                 'Year' = sort(rep(seq(2016,2019,1),3)))

df$Name <- as.character(df$Name)
df$Title <- as.character(df$Title)
df$Year <- as.integer(df$Year)

library(dplyr)

df <- df %>% 
  arrange(Name, Title)

You didn't mention what is the runtot column. Therefore, i didn't include it.

MrCorote
  • 565
  • 8
  • 21
  • Thanks! Sorry, I should have been more specific. Regarding `runtot`, I would like to do the running total for each of the values in the `name` column. In other words, I would like to know how the running total grows for a, how it grows for b, c, etc. – console.log Aug 21 '19 at 01:59