-3

Please see below a simplified data set, which is on a country-year basis:

country <- c("CountryA", "CountryA", "CountryA", "CountryA",
"CountryB", "CountryB", "CountryB", "CountryB",
"CountryC", "CountryC", "CountryC", "CountryC")

year <- c(2001, 2002, 2003, 2004,
2001, 2002, 2003, 2004,
2001, 2002, 2003, 2004)

v1 <- c(2, 3, 5, 4, 3, 3, 1, 2, 1, 4, 3, 2)

df1 <- data.frame(country, year, v1)

df1

 country   year   v1 
CountryA   2001    2
CountryA   2002    3
CountryA   2003    5
CountryA   2004    4
CountryB   2001    3
CountryB   2002    3
CountryB   2003    1
CountryB   2004    2
CountryC   2001    1
CountryC   2002    4
CountryC   2003    3
CountryC   2004    2

My question is:

How can I write a code that creates an incident-based subset of the above data set like the one below:

cntry <- c("CountryA", "CountryB", "CountryC")
stYear <- c(2001, 2002, 2003)
endYear <- c(2003, 2004, 2003)
v1Max <- c(5, 3, 3)
v1Ave <- c(3.33, 2, 3)

df2 <- data.frame(cntry, year, v1)

df2

   cntry   stYear   endYear   v1Max   v1Ave
CountryA     2001      2003       5    3.33
CountryB     2002      2004       3       2   
CountryC     2003      2003       3       3

In other words, I need to code each incident separately into a new data frame. (For example, the first line in df2 above is the incident in CountryA from 2001 to 2003.) While doing this, I need to also recode the values within the corresponding time frame. (For example, v1Max in df2 is the maximum value v1 takes in df1 for the duration of the incident. Similarly, v1Ave in df2 is the average.)

If you can provide me with a code that performs the above transformation from df1 to df2, I can then enhance it to solve my problem.

Thanks!

neutral
  • 107
  • 4
  • 13

1 Answers1

0

Your title implies some kind of data cleanup or subsetting, but what you're after here is aggregation or summary. To do that with the data provided, dplyr is a great option:

library(dplyr)
df2 <- group_by(df1, country) %>% 
       summarize(start = min(year), end = max(year), v1Max = max(v1), v1Avg = mean(v1))

# A tibble: 3 x 5
#   country start   end v1Max v1Avg
#    <fctr> <dbl> <dbl> <dbl> <dbl>
#1 CountryA  2001  2004     5  3.50
#2 CountryB  2001  2004     3  2.25
#3 CountryC  2001  2004     4  2.50
jdobres
  • 11,339
  • 1
  • 17
  • 37
  • Thank you. But your code returns the same start and end years for all rows. – neutral Aug 09 '16 at 02:21
  • That's because the data set you provided as an example has the same start and end year for each value of "country". If you removed the 2004 entry for CountryA, for example, the summary would only run until 2003. – jdobres Aug 09 '16 at 02:23
  • I understand. But I want to create a subset, as in df2 above. I do not want the subset to include all years in df1. – neutral Aug 09 '16 at 02:29
  • It seems you want very specific behavior here, pulling out different years per country. To do something like that, you need to subset the data before using the above code, with conditions such as `subset(df, (country == 'CountryA' & year >= 2001 & year < 2004) | (country == 'CountryB' & year >= 2002 & year < 2005) | (country == 'CountryC' & year == 2003))` – jdobres Aug 09 '16 at 02:39