3

The code posted at the bottom does a nice job of filling in a dataframe, using package tidyr, so that all ID's end up with the same number of periods, in the case of period defined as number of months ("Period_1" in the below code). Base dataframe testDF has ID of 1 with 5 periods, and ID of 50 and 60 with only 3 periods each. The tidyr code creates additional periods ("Period_1") for ID of 50 and 60 so they too have 5 Period_1´s. The code copies down the "Bal" and "State" fields so that all ID end up with the same number of Period_1, which is correct.

However, how would I extend the calendar month expression of "Period_2" in the same manner, as illustrated immediately below?

enter image description here

Code:

library(tidyr)

testDF <-
  data.frame(
    ID = as.numeric(c(rep(1,5),rep(50,3),rep(60,3))),
    Period_1 = as.numeric(c(1:5,1:3,1:3)),
    Period_2 = c("2012-06","2012-07","2012-08","2012-09","2012-10","2013-06","2013-07","2013-08","2012-01","2012-02","2012-03"),
    Bal = as.numeric(c(rep(10,5),21:23,36:34)),
    State = c("XX","AA","BB","CC","XX","AA","BB","CC","SS","XX","AA")
  )

testDFextend <-
  testDF %>%
  tidyr::complete(ID, nesting(Period_1)) %>%
  tidyr::fill(Bal, State, .direction = "down")

testDFextend

Edit: rolling from one year to the next

A better OP example would have Period 2 = c("2012-06","2012-07","2012-08","2012-09","2012-10","2013-06","2013-07","2013-08","2012-10","2012-11","2012-12"), providing an example whereby extending Period_2 causes a rollover to the next year. Below I add to the tidyr/dplyr answer below to correctly roll over the year:

library(tidyr)
library(dplyr)

testDF <-
  data.frame(
    ID = as.numeric(c(rep(1,5),rep(50,3),rep(60,3))),
    Period_1 = as.numeric(c(1:5,1:3,1:3)),
    Period_2 = c("2012-06","2012-07","2012-08","2012-09","2012-10","2013-06","2013-07","2013-08","2012-10","2012-11","2012-12"),
    Bal = as.numeric(c(rep(10,5),21:23,36:34)),
    State = c("XX","AA","BB","CC","XX","AA","BB","CC","SS","XX","AA")
  )

testDFextend <-
  testDF %>%
  tidyr::complete(ID, nesting(Period_1)) %>%
  tidyr::fill(Bal, State, .direction = "down")

testDFextend %>% 
  separate(Period_2, into = c("year", "month"), convert = TRUE) %>% 
  fill(year) %>% 
  group_by(ID) %>% 
  mutate(month = sprintf("%02d", zoo::na.spline(month))) %>% 
  unite("Period_2", year, month, sep = "-") %>% 
# Now I add the below lines:
  separate(Period_2, into = c("year", "month"), convert = TRUE) %>% 
  mutate(month = as.integer(sprintf("%02d", zoo::na.spline(month)))) %>%
  mutate(year1 = ifelse(month > 12, year+trunc(month/12), year)) %>%
  mutate(month1 = ifelse(month > 12 & month%%12!= 0, month%%12, month)) %>%
  mutate(month1 = ifelse(month1 < 10, paste0(0,month1),month1)) %>%
  unite("Period_2", year1, month1, sep = "-") %>%
  select("ID","Period_1","Period_2","Bal","State")
Village.Idyot
  • 1,359
  • 2
  • 8

4 Answers4

2

by ID you can strsplit the date, and take the elements to create a new data.frame to merge with.

ml <- max(with(testDF, tapply(ID, ID, length)))  ## get max. period length

by(testDF, testDF$ID, \(x) {
  sp <- strsplit(x$Period_2, '-')
  s <- as.numeric(sp[[1]][[2]])
  if (ml != nrow(x))
  merge(x, data.frame(Period_2=paste0(sp[[1]][[1]], '-', sprintf('%02d', (s + nrow(x)):(s + ml - 1))),
                      Period_1=(nrow(x) + 1):ml,
                      ID=x$ID[nrow(x)], Bal=x$Bal[nrow(x)], State=x$State[nrow(x)]), all=TRUE)
  else x
}) |> c(make.row.names=FALSE) |> do.call(what=rbind)
#    ID Period_1 Period_2 Bal State
# 1   1        1  2012-06  10    XX
# 2   1        2  2012-07  10    AA
# 3   1        3  2012-08  10    BB
# 4   1        4  2012-09  10    CC
# 5   1        5  2012-10  10    XX
# 6  50        1  2013-06  21    AA
# 7  50        2  2013-07  22    BB
# 8  50        3  2013-08  23    CC
# 9  50        4  2013-09  23    CC
# 10 50        5  2013-10  23    CC
# 11 60        1  2012-01  36    SS
# 12 60        2  2012-02  35    XX
# 13 60        3  2012-03  34    AA
# 14 60        4  2012-04  34    AA
# 15 60        5  2012-05  34    AA

Edit

For older R versions (although it's recommended to always use update software), do:

do.call(c(by(testDF, testDF$ID, function(x) {
  sp <- strsplit(x$Period_2, '-')
  s <- as.numeric(sp[[1]][[2]])
  if (ml != nrow(x))
    merge(x, data.frame(Period_2=paste0(sp[[1]][[1]], '-', sprintf('%02d', (s + nrow(x)):(s + ml - 1))),
                        Period_1=(nrow(x) + 1):ml,
                        ID=x$ID[nrow(x)], Bal=x$Bal[nrow(x)], State=x$State[nrow(x)]), all=TRUE)
  else x
}), make.row.names=FALSE), what=rbind)
jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • Hi jay.sf. I can't get this to run quite yet. I'm not familiar with the \ in `\(x)` in the first line, nor the `|>` throughout. I think these are new symbols employed in the newest version of R. Do you know what these are the equivalents of in pre-newest-R? Also, is this all done in base R? – Village.Idyot Nov 21 '22 at 09:37
  • @Village.Idyot Why not updating R? Please see edit. – jay.sf Nov 21 '22 at 09:56
  • Yes, I need to update R. But I'm afraid if I do a lot of the code I've written in old R won't work with the new R and then I'll end up spending countless hours debugging so it works with new R. Maybe my fears are unfounded. – Village.Idyot Nov 21 '22 at 10:01
  • @Village.Idyot I definitely would give it a shot, the R-core team is very conservative and highly emphasizes backward-compatibility (in contrast to e.g. the tidyverse folks). You can always install an older R version. – jay.sf Nov 21 '22 at 10:07
2

A tidyverse solution based on zoo::na.spline. Note that it does not handle year changes. It's harder than I thought, especially because zoo::na.spline does not seem to work on yearmon format.

library(tidyr)
library(dplyr)
testDFextend %>% 
  separate(Period_2, into = c("year", "month"), convert = TRUE) %>% 
  fill(year) %>% 
  group_by(ID) %>% 
  mutate(month = sprintf("%02d", zoo::na.spline(month))) %>% 
  unite("Period_2", year, month, sep = "-")

output

      ID Period_1 Period_2   Bal State
   <dbl>    <dbl> <chr>    <dbl> <chr>
 1     1        1 2012-06     10 XX   
 2     1        2 2012-07     10 AA   
 3     1        3 2012-08     10 BB   
 4     1        4 2012-09     10 CC   
 5     1        5 2012-10     10 XX   
 6    50        1 2013-06     21 AA   
 7    50        2 2013-07     22 BB   
 8    50        3 2013-08     23 CC   
 9    50        4 2013-09     23 CC   
10    50        5 2013-10     23 CC   
11    60        1 2012-01     36 SS   
12    60        2 2012-02     35 XX   
13    60        3 2012-03     34 AA   
14    60        4 2012-04     34 AA   
15    60        5 2012-05     34 AA 
Maël
  • 45,206
  • 3
  • 29
  • 67
  • I think you can handle year changes by grouping, using `cur_data()` and then making a length 5 `seq.Date`, starting with the first element. – diomedesdata Nov 21 '22 at 10:15
  • Please see the edit to my OP, a proposed change to this answer to accommodate the situation where Period_2 had been set as defined Period 2 as ´c("2012-06","2012-07","2012-08","2012-09","2012-10","2013-06","2013-07","2013-08","2012-10","2012-11","2012-12")´, a situation where the year rolls over to the next year during an extension (as in ID = 60) – Village.Idyot Nov 21 '22 at 18:08
2

For each ID convert Period_2 to yearmon class. This represents year and month without day. Internally it uses year + fraction where fraction = 0, 1/12, ..., 11/12 for the 12 months. Expand it out using seq. Then convert it back to character or omit the format line to keep the result as a yearmon object.

library(dplyr, exclude = c("filter", "lag"))
library(zoo)

testDFextend %>%
  group_by(ID) %>%
  mutate(Period_2 = as.yearmon(first(Period_2)) + seq(0, by=1/12, length=n())) %>%
  mutate(Period_2 = format(Period_2, "%Y-%m")) %>%
  ungroup

giving:

# A tibble: 15 × 5
      ID Period_1 Period_2   Bal State
   <dbl>    <dbl> <chr>    <dbl> <chr>
 1     1        1 2012-06     10 XX   
 2     1        2 2012-07     10 AA   
 3     1        3 2012-08     10 BB   
 4     1        4 2012-09     10 CC   
 5     1        5 2012-10     10 XX   
 6    50        1 2013-06     21 AA   
 7    50        2 2013-07     22 BB   
 8    50        3 2013-08     23 CC   
 9    50        4 2013-09     23 CC   
10    50        5 2013-10     23 CC   
11    60        1 2012-01     36 SS   
12    60        2 2012-02     35 XX   
13    60        3 2012-03     34 AA   
14    60        4 2012-04     34 AA   
15    60        5 2012-05     34 AA   
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Hi, yes that works very nicely for keeping the Period_2 calendar months correct. My example OP is suboptimal, I should have better defined Period 2 as `c("2012-06","2012-07","2012-08","2012-09","2012-10","2013-06","2013-07","2013-08","2012-10","2012-11","2012-12")` in which case some of the above answers extend the ID of 60 for example to 2012-13 and 2012-14 for extended periods 4 and 5. Your solution on the other hand produces the correct 2013-01 and 2013-02 instead. – Village.Idyot Nov 21 '22 at 16:35
  • However, I have never before seen the use of an "exclude" as in `library(dplyr, exclude = c("filter", "lag"))`. This code is being used in an App that uses dplyr throughout. Will this exclusion conflict with other places where dplyr is used? – Village.Idyot Nov 21 '22 at 16:38
  • 1
    dplyr's `lag` and `filter` conflict with base R functions of the same name and potentially can cause errors when used with other packages. I routinely use this to prevent these mysterious errors. You can still use `dplyr::lag` and `dplyr::filter` if you need to access these dplyr functions. – G. Grothendieck Nov 21 '22 at 16:41
1

I think the nicest way to do this is to make use of the padr package, which is built to pad data.frames where there are missing/incomplete columns.

This uses grouping and cur_data() to make the correct date sequence in Period_2.

library(dplyr)
library(tidyr)
library(padr)

n_periods <- 5

testDF %>%
  pad_int(end_val = n_periods , by = "Period_1", group = "ID") %>%
  group_by(ID) %>%
  mutate(Period_2 = as.Date(paste0(Period_2, "-01"))) %>%
  mutate(Period_2 = seq(cur_data()$Period_2[1], by = "months", length.out = 
    n_periods) %>% format("%Y-%m")) %>%
  fill(Bal, State) %>%
  ungroup() %>%
  select(ID, Period_1, Period_2, Bal, State)    
      ID Period_1 Period_2   Bal State
   <dbl>    <dbl> <chr>    <dbl> <chr>
 1     1        1 2012-06     10 XX   
 2     1        2 2012-07     10 AA   
 3     1        3 2012-08     10 BB   
 4     1        4 2012-09     10 CC   
 5     1        5 2012-10     10 XX   
 6    50        1 2013-06     21 AA   
 7    50        2 2013-07     22 BB   
 8    50        3 2013-08     23 CC   
 9    50        4 2013-09     23 CC   
10    50        5 2013-10     23 CC   
11    60        1 2012-01     36 SS   
12    60        2 2012-02     35 XX   
13    60        3 2012-03     34 AA   
14    60        4 2012-04     34 AA   
15    60        5 2012-05     34 AA 

Note that this will handle cases when the year rolls over to the next year during Period_2.

Finally, you could adjust n_periods if you needed a different number of periods (or use a function to figure it out automatically, like jay.sf's answer).

diomedesdata
  • 995
  • 1
  • 6
  • 15
  • Very fast when run against large data and like you point out, it does handle cases when rolling `Period_2` from one year to the next. – Village.Idyot Nov 22 '22 at 13:20