0

I have a dataframe made up of 6 columns. Columns 1 to 5 each have discrete names/values, such as a district, year, month, age interval and gender. The sixth column is the number of death counts for that specific combination.

               District Gender Year Month Age.Group Total.Deaths
1              Eastern  Female 2003     1        -1            0
2              Eastern  Female 2003     1        -2            2
3              Eastern  Female 2003     1         0            2
4              Eastern  Female 2003     1      01-4            1
5              Eastern  Female 2003     1     05-09            0
6              Eastern  Female 2003     1     10-14            1
7              Eastern  Female 2003     1     15-19            0
8              Eastern  Female 2003     1     20-24            4
9              Eastern  Female 2003     1     25-29            9
10             Eastern  Female 2003     1     30-34            3
11             Eastern  Female 2003     1     35-39            7
12             Eastern  Female 2003     1     40-44            5
13             Eastern  Female 2003     1     45-49            5
14             Eastern  Female 2003     1     50-54            8
15             Eastern  Female 2003     1     55-59            5
16             Eastern  Female 2003     1     60-64            4
17             Eastern  Female 2003     1     65-69            7
18             Eastern  Female 2003     1     70-74            8
19             Eastern  Female 2003     1     75-79            5
20             Eastern  Female 2003     1     80-84           10
21             Eastern  Female 2003     1       85+           11
22             Eastern  Female 2003     2        -1            0
23             Eastern  Female 2003     2        -2            0
24             Eastern  Female 2003     2         0            4
25             Eastern  Female 2003     2      01-4            1
26             Eastern  Female 2003     2     05-09            2
27             Eastern  Female 2003     2     10-14            2
28             Eastern  Female 2003     2     15-19            0

I would like to filter, or extract, smaller dataframes from this big dataframe. For example, I would like to only have four age groups. These four age groups will each contain:

Group 0: Consisting of Age.Group -1, -2 and 0.
Group 1-4: Consisting of Age.Group 01-4
Group 5-14: Consisting of Age.Group 05-09 and 10-14
Group 15+: Consisting of Age.Group 15-19 to 85+

The Total.Deaths will then be the sum for each of these groups.

So I want it to look like this

               District Gender Year Month Age.Group Total.Deaths
1              Eastern  Female 2003     1         0            4
2              Eastern  Female 2003     1      01-4            1
3              Eastern  Female 2003     1     05-14            1
4              Eastern  Female 2003     1       15+            104
5              Eastern  Female 2003     2         0            4
6              Eastern  Female 2003     2      01-4            1
7              Eastern  Female 2003     2     05-14            4
8              Eastern  Female 2003     2       15+            ...

I have a lot of data and have searched for a few days, but unable to find a function to help be do this.

OSlOlSO
  • 441
  • 7
  • 14

1 Answers1

1

There may be a pithier way of recoding your age variable using something like recode from the car package, particularly since you've conveniently got your current age variable coded with levels that sort nicely as characters. But for only a few levels, I often just recode them by hand by creating a new age variable, and this method is good practice for just 'getting stuff done' in R:

#Reading your data in from a text file that I made via copy/paste
dat <- read.table("~/Desktop/soEx.txt",sep="",header=TRUE)

#Make sure Age.Group is ordered and init new age variable
dat$Age.Group <- factor(dat$Age.Group,ordered=TRUE)
dat$AgeGroupNew <- rep(NA,nrow(dat))

#The recoding
dat$AgeGroupNew[dat$Age.Group <= "0"] <- "0"
dat$AgeGroupNew[dat$Age.Group == "01-4"] <- "01-4"
dat$AgeGroupNew[dat$Age.Group >= "05-09" & dat$Age.Group <= "10-14" ] <- "05-14"
dat$AgeGroupNew[dat$Age.Group > "10-14" ] <- "15+"

Then we can generate summaries using ddply and summarise:

datNew <- ddply(dat,.(District,Gender,Year,Month,AgeGroupNew),summarise,
            TotalDeaths = sum(Total.Deaths))

I was worried at first because I got 91 deaths instead of 104 as you indicated, but I counted by hand and 91 is right I think. A typo, perhaps.

joran
  • 169,992
  • 32
  • 429
  • 468
  • Thanks so much joran. It's working on my side - and I learned a few R things. Thank you again. Sorry for the 104 typo! – OSlOlSO Jul 10 '11 at 09:07