R - group_by utilizing splinefun

Question

I am trying to group my data by Year and CountyID then use splinefun (cubic spline interpolation) on the subset data. I am open to ideas, however the splinefun is a must and cannot be changed.

Here is the code I am trying to use:

age <- seq(from = 0, by = 5, length.out = 18)

TOT_POP <- df %.%
group_by(unique(df$Year), unique(df$CountyID) %.%
splinefun(age, c(0, cumsum(df$TOT_POP)), method = "hyman")

Here is a sample of my data Year = 2010 : 2013, Agegrp = 1 : 17 and CountyIDs are equal to all counties in the US.

CountyID    Year        Agegrp      TOT_POP
1001        2010        1           3586
1001        2010        2           3952
1001        2010        3           4282
1001        2010        4           4136
1001        2010        5           3154

What I am doing is taking the Agegrp 1 : 17 and splitting the grouping into individual years 0 - 84. Right now each group is a representation of 5 years. The splinefun allows me to do this while providing a level of mathematical rigour to the process i.e., splinefun allows me provide a population total per each year of age, in each individual county in the US.

Lastly, the splinefun code by itself does work but within the group_by function it does not, it produces:

Error: wrong result size(4), expected 68 or 1.

The splinefun code the way I am using it works like this

TOT_POP <- splinefun(age, c(0, cumsum(df$TOT_POP)), 
           method  = "hyman") 

TOT_POP = pmax(0, diff(TOT_POP(c(0:85))))

Which was tested on one CountyID during one Year. I need to iterate this process over "x" amount of years and roughly 3200 counties.

Let me get this straight. You're hoping to split the data frame according to two variables. Then, for each of the smaller data frames, you want to use `splinefun` to get a spline function mapping `age` to `TOT_POP`? You then want to use that function to interpolate the total population at all ages between 0 and 85, since your original data only had the populations for ages 5, 10, 15, 20...? Perhaps you could implement this with `split` and `lapply` or `plyr` and get something working, and then someone would be better prepared to help you with `dplyr`. — kdauria, Jul 08 '14 at 21:11
The splinefun will be used on a subset of data that has Agegrp 1 : 17 for one county and one year at a time. The final age groups will be individual years 0 : 84. — j riot, Jul 08 '14 at 21:22
I'm still confused. Perhaps you can change the `df` data frame? Replace `Agegrp` with the associated `age`. For instance, `df$Agegrp = df$Agegrp*5`. `colnames(df)[3] = "age"`. This may simplify your question a bit. — kdauria, Jul 09 '14 at 02:35

kdauria · Answer 1 · 2014-07-09T03:32:42.383

# Reproducible data set
set.seed(22)
df = data.frame( CountyID = rep(1001:1005,each = 100), 
                 Year = rep(2001:2010, each = 10),
                 Agegrp = sample(1:17, 500, replace=TRUE),
                 TOT_POP = rnorm(500, 10000, 2000))

# Convert Agegrp to age
df$Agegrp = df$Agegrp*5
colnames(df)[3] = "age"

# Make a spline function for every CountyID-Year combination
split.dfs = split(df, interaction(df$CountyID, df$Year))
spline.funs = lapply(split.dfs, function(x) splinefun(x[,"age"], x[,"TOT_POP"]))

# Use the spline functions to interpolate populations for all years between 0 and 85
new.split.dfs = list()
for( i in 1:length(split.dfs)) {
  new.split.dfs[[i]] = data.frame( CountyID=split.dfs[[i]]$CountyID[1],
                                   Year=split.dfs[[i]]$Year[1],
                                   age=0:85,
                                   TOT_POP=spline.funs[[i]](0:85))
}


# Does this do what you want? If so, then it will be 
# easier for others to work from here
# > head(new.split.dfs[[1]])
# CountyID Year age  TOT_POP
# 1     1001 2001   0 909033.4
# 2     1001 2001   1 833999.8
# 3     1001 2001   2 763181.8
# 4     1001 2001   3 696460.2
# 5     1001 2001   4 633716.0
# 6     1001 2001   5 574829.9
# > tail(new.split.dfs[[2]])
# CountyID Year age   TOT_POP
# 81     1002 2001  80 10201.693
# 82     1002 2001  81  9529.030
# 83     1002 2001  82  8768.306
# 84     1002 2001  83  7916.070
# 85     1002 2001  84  6968.874
# 86     1002 2001  85  5923.268

score 0 · Answer 2 · answered Jul 09 '14 at 18:30

First, I believe I was using the wrong wording in what I was trying to achieve, my apologies; group_by actually wasn't going to solve the issue. However, I was able to solve the problem using two functions and ddply. Here is the code that solved the issue:

interpolate <- function(x, ageVector){
result <- splinefun(ageVector, 
          c(0, cumsum(x)), method = "hyman")
diff(result(c(0:85)))
}

mainFunc <- function(df){

age <- seq(from = 0, by = 5, length.out = 18)
colNames <- setdiff(colnames(df)
            c("Year","CountyID","AgeGrp"))
colWiseSpline <- colwise(interpolate, .cols = true,
                 age)(df[ , colNames])

cbind(data.frame(
Year = df$Year[1],
County = df$CountyID[1],
Agegrp = 0:84
),
colWiseSpline
)
}

CompleteMainRaw <- ddply(.data = df, 
                    .variables = .(CountyID, Year), 
                    .fun = mainFunc)

The code now takes each county by year and runs the splinefun on that subset of population data. At the same time it creates a data.frame with the results i.e., splits the data from 17 age groups to 85 age groups while factoring it our appropriately; which is what splinefun does.

Thanks!

R - group_by utilizing splinefun

2 Answers2