0

I am trying to group my data by Year and CountyID then use splinefun (cubic spline interpolation) on the subset data. I am open to ideas, however the splinefun is a must and cannot be changed.

Here is the code I am trying to use:

age <- seq(from = 0, by = 5, length.out = 18)

TOT_POP <- df %.%
group_by(unique(df$Year), unique(df$CountyID) %.%
splinefun(age, c(0, cumsum(df$TOT_POP)), method = "hyman")

Here is a sample of my data Year = 2010 : 2013, Agegrp = 1 : 17 and CountyIDs are equal to all counties in the US.

CountyID    Year        Agegrp      TOT_POP
1001        2010        1           3586
1001        2010        2           3952
1001        2010        3           4282
1001        2010        4           4136
1001        2010        5           3154

What I am doing is taking the Agegrp 1 : 17 and splitting the grouping into individual years 0 - 84. Right now each group is a representation of 5 years. The splinefun allows me to do this while providing a level of mathematical rigour to the process i.e., splinefun allows me provide a population total per each year of age, in each individual county in the US.

Lastly, the splinefun code by itself does work but within the group_by function it does not, it produces:

Error: wrong result size(4), expected 68 or 1. 

The splinefun code the way I am using it works like this

TOT_POP <- splinefun(age, c(0, cumsum(df$TOT_POP)), 
           method  = "hyman") 

TOT_POP = pmax(0, diff(TOT_POP(c(0:85))))

Which was tested on one CountyID during one Year. I need to iterate this process over "x" amount of years and roughly 3200 counties.

joran
  • 169,992
  • 32
  • 429
  • 468
j riot
  • 544
  • 3
  • 6
  • 16
  • Let me get this straight. You're hoping to split the data frame according to two variables. Then, for each of the smaller data frames, you want to use `splinefun` to get a spline function mapping `age` to `TOT_POP`? You then want to use that function to interpolate the total population at all ages between 0 and 85, since your original data only had the populations for ages 5, 10, 15, 20...? Perhaps you could implement this with `split` and `lapply` or `plyr` and get something working, and then someone would be better prepared to help you with `dplyr`. – kdauria Jul 08 '14 at 21:11
  • The splinefun will be used on a subset of data that has Agegrp 1 : 17 for one county and one year at a time. The final age groups will be individual years 0 : 84. – j riot Jul 08 '14 at 21:22
  • I'm still confused. Perhaps you can change the `df` data frame? Replace `Agegrp` with the associated `age`. For instance, `df$Agegrp = df$Agegrp*5`. `colnames(df)[3] = "age"`. This may simplify your question a bit. – kdauria Jul 09 '14 at 02:35

2 Answers2

1
# Reproducible data set
set.seed(22)
df = data.frame( CountyID = rep(1001:1005,each = 100), 
                 Year = rep(2001:2010, each = 10),
                 Agegrp = sample(1:17, 500, replace=TRUE),
                 TOT_POP = rnorm(500, 10000, 2000))

# Convert Agegrp to age
df$Agegrp = df$Agegrp*5
colnames(df)[3] = "age"

# Make a spline function for every CountyID-Year combination
split.dfs = split(df, interaction(df$CountyID, df$Year))
spline.funs = lapply(split.dfs, function(x) splinefun(x[,"age"], x[,"TOT_POP"]))

# Use the spline functions to interpolate populations for all years between 0 and 85
new.split.dfs = list()
for( i in 1:length(split.dfs)) {
  new.split.dfs[[i]] = data.frame( CountyID=split.dfs[[i]]$CountyID[1],
                                   Year=split.dfs[[i]]$Year[1],
                                   age=0:85,
                                   TOT_POP=spline.funs[[i]](0:85))
}


# Does this do what you want? If so, then it will be 
# easier for others to work from here
# > head(new.split.dfs[[1]])
# CountyID Year age  TOT_POP
# 1     1001 2001   0 909033.4
# 2     1001 2001   1 833999.8
# 3     1001 2001   2 763181.8
# 4     1001 2001   3 696460.2
# 5     1001 2001   4 633716.0
# 6     1001 2001   5 574829.9
# > tail(new.split.dfs[[2]])
# CountyID Year age   TOT_POP
# 81     1002 2001  80 10201.693
# 82     1002 2001  81  9529.030
# 83     1002 2001  82  8768.306
# 84     1002 2001  83  7916.070
# 85     1002 2001  84  6968.874
# 86     1002 2001  85  5923.268
kdauria
  • 6,300
  • 4
  • 34
  • 53
0

First, I believe I was using the wrong wording in what I was trying to achieve, my apologies; group_by actually wasn't going to solve the issue. However, I was able to solve the problem using two functions and ddply. Here is the code that solved the issue:

interpolate <- function(x, ageVector){
result <- splinefun(ageVector, 
          c(0, cumsum(x)), method = "hyman")
diff(result(c(0:85)))
}

mainFunc <- function(df){

age <- seq(from = 0, by = 5, length.out = 18)
colNames <- setdiff(colnames(df)
            c("Year","CountyID","AgeGrp"))
colWiseSpline <- colwise(interpolate, .cols = true,
                 age)(df[ , colNames])

cbind(data.frame(
Year = df$Year[1],
County = df$CountyID[1],
Agegrp = 0:84
),
colWiseSpline
)
}

CompleteMainRaw <- ddply(.data = df, 
                    .variables = .(CountyID, Year), 
                    .fun = mainFunc)

The code now takes each county by year and runs the splinefun on that subset of population data. At the same time it creates a data.frame with the results i.e., splits the data from 17 age groups to 85 age groups while factoring it our appropriately; which is what splinefun does.

Thanks!

j riot
  • 544
  • 3
  • 6
  • 16