3

I have a dataframe test_case. I have missing data in a column (income).

test_case <- data.frame(
person=c(1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3),
year=c(2010, 2011, 2012, 2010, 2011, 2012, 2010, 2011, 2013, 2014, 2014, 2014),
income=c(4, 10, 13, NA, NA, NA, 13, NA, NA, NA, NA, NA),
cutoff=c(0, 0, 2, 0, 0, 0, 3, 0, 0, 0, 0, 0)
)

The variable cutoff specifies the number of times that I would like to 'carry forward' the values in income into subsequent rows (using the na.locf() method in the package zoo). For example, in the dataframe above, the value for 2 in cutoff indicates that income should be carried forward twice.

I have seen examples on SO about specifying how to use na.locf to carry forward n times when n is constant. But in my case, I am having trouble generalizing (R -- Carry last observation forward n times) when n is changing.

Here is my original dataframe:

   person year income cutoff
1       1 2010      4      0
2       1 2011     10      0
3       1 2012     13      2
4       2 2010     NA      0
5       2 2011     NA      0
6       2 2012     NA      0
7       3 2010     13      3
8       3 2011     NA      0
9       3 2013     NA      0
10      3 2014     NA      0
11      3 2014     NA      0
12      3 2014     NA      0

And here is the desired output:

   person year income cutoff
1       1 2010      4      0
2       1 2011     10      0
3       1 2012     13      2
4       2 2010     13      0
5       2 2011     13      0
6       2 2012     NA      0
7       3 2010     13      3
8       3 2011     13      0
9       3 2013     13      0
10      3 2014     13      0
11      3 2014     NA      0
12      3 2014     NA      0
Community
  • 1
  • 1
user3795577
  • 187
  • 1
  • 18
  • 1
    If person 1 and person 2 are different people, why would you want to carry forward person 1's income in 2012 to person 2's income in 2010? – David Robinson Dec 07 '15 at 22:16

3 Answers3

5

Here's an attempt using data.table. The grouping method is at @jeremys answer, though I'm avoiding ifelse or lapply here, rather combining the first income value replicated according to first income value with NAs values replicate .N - (cutoff[1L] + 1L) times. I'm also operating only on the values since first time cutoff > 0L)

library(data.table)
setDT(test_case)[which.max(cutoff > 0L):.N, # Or `cutoff > 0L | is.na(income)`
                 income := c(rep(income[1L], cutoff[1L] + 1L), rep(NA, .N - (cutoff[1L] + 1L))), 
                 by = cumsum(cutoff != 0L)]
test_case
#     person year income cutoff
#  1:      1 2010      4      0
#  2:      1 2011     10      0
#  3:      1 2012     13      2
#  4:      2 2010     13      0
#  5:      2 2011     13      0
#  6:      2 2012     NA      0
#  7:      3 2010     13      3
#  8:      3 2011     13      0
#  9:      3 2013     13      0
# 10:      3 2014     13      0
# 11:      3 2014     NA      0
# 12:      3 2014     NA      0
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
3

Here's an answer using dplyr.

It works by grouping by the cumulative sum of different cutoffs.

Then it makes a list of one FALSE if cutoff is 0, and cutoff number of TRUEs, which is unlisted and sliced to the size of the group.

Then using ifelse, the income is either unmodified or made to be the first income (ie the cutoff one).

library(dplyr)

test_case %>% group_by(z = cumsum(cutoff != 0)) %>%
              mutate(income = ifelse(unlist(lapply(cutoff, function(x) rep(as.logical(x), max(1,x + 1))))[1:n()], income[1], income))

Source: local data frame [12 x 5]
Groups: z [3]

       z person  year income cutoff
   (int)  (dbl) (dbl)  (dbl)  (dbl)
1      0      1  2010      4      0
2      0      1  2011     10      0
3      1      1  2012     13      2
4      1      2  2010     13      0
5      1      2  2011     13      0
6      1      2  2012     NA      0
7      2      3  2010     13      3
8      2      3  2011     13      0
9      2      3  2013     13      0
10     2      3  2014     13      0
11     2      3  2014     NA      0
12     2      3  2014     NA      0
jeremycg
  • 24,657
  • 5
  • 63
  • 74
2

A solution using na.locf could work in a similar way to @jeremycg's solution. We simply need to group by cumsum(cutoff != 0) and another variable which is the shifted row_number

My solution isn't as elegant as jeremycg's one, but this is how I approached it:

library(dplyr)
library(zoo)
test_case %>%
  mutate(
    rownum = row_number(),
    cutoff2 = ifelse(cutoff == 0, NA, cutoff + rownum),
    cutoff2 = na.locf(cutoff2, na.rm = FALSE),
    cutoff2 = ifelse(rownum > cutoff2, NA, cutoff2)
  ) %>%
  group_by(z = cumsum(cutoff != 0), cutoff2) %>%
  mutate(income = na.locf(income, na.rm = FALSE))
# Source: local data frame [12 x 7]
# Groups: z, cutoff2 [5]
# 
#    person  year income cutoff rownum cutoff2     z
#     (dbl) (dbl)  (dbl)  (dbl)  (int)   (dbl) (int)
# 1       1  2010      4      0      1      NA     0
# 2       1  2011     10      0      2      NA     0
# 3       1  2012     13      2      3       5     1
# 4       2  2010     13      0      4       5     1
# 5       2  2011     13      0      5       5     1
# 6       2  2012     NA      0      6      NA     1
# 7       3  2010     13      3      7      10     2
# 8       3  2011     13      0      8      10     2
# 9       3  2013     13      0      9      10     2
# 10      3  2014     13      0     10      10     2
# 11      3  2014     NA      0     11      NA     2
# 12      3  2014     NA      0     12      NA     2
chappers
  • 2,415
  • 14
  • 16