6

I'm developing a model that forecasts completed fertility for an age cohort. I currently have a data frame like this, where the rows are ages and the columns are years. The value in each cell is age-specific fertility for that year:

> df1
   iso3    sex age fert1953 fert1954 fert1955
14  AUS female  13    0.000  0.00000  0.00000
15  AUS female  14    0.000  0.00000  0.00000
16  AUS female  15   13.108 13.42733 13.74667
17  AUS female  16   26.216 26.85467 27.49333
18  AUS female  17   39.324 40.28200 41.24000

However, what I want is each row to be a cohort. Because the rows and columns represent individual years, the cohort data can be obtained by getting the diagonal. I'm looking for a result like this:

> df2
   iso3    sex ageIn1953 fert1953  fert1954  fert1955
14  AUS female        13    0.000   0.00000  13.74667
15  AUS female        14    0.000  13.42733  27.49333
16  AUS female        15   13.108  26.85467  41.24000
17  AUS female        16   26.216  40.28200  [data..] 
18  AUS female        17   39.324  [data..]  [data..] 

Here's the df1 data frame:

df1 <- structure(list(iso3 = c("AUS", "AUS", "AUS", "AUS", "AUS"), sex = c("female", 
"female", "female", "female", "female"), age = c(13, 14, 15, 
16, 17), fert1953 = c(0, 0, 13.108, 26.216, 39.324), fert1954 = c(0, 
0, 13.4273333333333, 26.8546666666667, 40.282), fert1955 = c(0, 
0, 13.7466666666667, 27.4933333333333, 41.24)), .Names = c("iso3", 
"sex", "age", "fert1953", "fert1954", "fert1955"), class = "data.frame", row.names = 14:18)

EDIT:

Here's the solution I ultimately used. It's based on David's answer, but I needed to do this for each level of iso3.

df.ls <- lapply(split(f3, f = f3$iso3), FUN = function(df1) {
  n <- ncol(df1) - 4
  temp <- mapply(function(x, y) lead(x, n = y), df1[, -seq_len(4)], seq_len(n))
  return(cbind(df1[seq_len(4)], temp))
})
f4 <- do.call("rbind", df.ls)
joran
  • 169,992
  • 32
  • 429
  • 468
rsoren
  • 4,036
  • 3
  • 26
  • 37

2 Answers2

4

I haven't tested for speed, but data.table v1.9.5, recently implemented a new (written in C) lead/lag function called shift

So for the columns you want to shift, you could potentially use it combined with mapply, for example

library(data.table)
n <- ncol(df1) - 4 # the number of years - 1
temp <- mapply(function(x, y) shift(x, n = y, type = "lead"), df1[, -seq_len(4)], seq_len(n))
cbind(df1[seq_len(4)], temp) # combining back with the unchanged columns
#    iso3    sex age fert1953 fert1954 fert1955
# 14  AUS female  13    0.000  0.00000 13.74667
# 15  AUS female  14    0.000 13.42733 27.49333
# 16  AUS female  15   13.108 26.85467 41.24000
# 17  AUS female  16   26.216 40.28200       NA
# 18  AUS female  17   39.324       NA       NA

Edit: You can easily install the development version of data.table from GitHub using

library(devtools) 
install_github("Rdatatable/data.table", build_vignettes = FALSE)

Either way, if you want dplyr, here goes

library(dplyr)
n <- ncol(df1) - 4 # the number of years - 1
temp <- mapply(function(x, y) lead(x, n = y), df1[, -seq_len(4)], seq_len(n))
cbind(df1[seq_len(4)], temp)
#    iso3    sex age fert1953 fert1954 fert1955
# 14  AUS female  13    0.000  0.00000 13.74667
# 15  AUS female  14    0.000 13.42733 27.49333
# 16  AUS female  15   13.108 26.85467 41.24000
# 17  AUS female  16   26.216 40.28200       NA
# 18  AUS female  17   39.324       NA       NA
Arun
  • 116,683
  • 26
  • 284
  • 387
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • This looks great. Unfortunately v1.9.4 is the latest version of data.table on CRAN, and I'm having trouble downloading the dev version from GitHub. Apparently this is a common problem on Windows. I'm working on using dplyr's ```lead()``` function instead; I think it should work... – rsoren Jan 08 '15 at 22:00
  • I made an adjustment to account for the fact that this needs to be done separately for each value of ```iso3``` (see my edit above). The issue with data.table was that I got a "Command failed (1)" error, which made it more difficult that the usual ```install_github()```. There's a link about this on data.table's installation page. Thanks much! – rsoren Jan 09 '15 at 19:12
1

Here is a base R approach:

df1[,5:ncol(df1)] <- mapply(function(x, y) {vec.list <- df1[-1:-y, x]
                       length(vec.list) <- nrow(df1)
                       vec.list},
                       x=5:ncol(df1), y=1:(ncol(df1)-4))
df1
#   iso3    sex age fert1953 fert1954 fert1955
#14  AUS female  13    0.000  0.00000 13.74667
#15  AUS female  14    0.000 13.42733 27.49333
#16  AUS female  15   13.108 26.85467 41.24000
#17  AUS female  16   26.216 40.28200       NA
#18  AUS female  17   39.324       NA       NA
Jota
  • 17,281
  • 7
  • 63
  • 93