-1

Having trouble getting either tidyr, reshape, or reshape2 to do something I imagine to be simple.

Have data that looks like this:

    agecat  year    Drug                    total
    <1      1999    Amikacin                12
    <1      1999    Cefepime                3
    <1      1999    Ceftazidime             13
    <1      2000    Amikacin                3
    <1      2000    Cefepime                6
    <1      2000    Ceftazidime             3
    <1      2000    Ciprofloxacin           4
    1-5     1999    Amikacin                37
    1-5     1999    Cefepime                25
    1-5     1999    Ceftazidime             38
    1-5     1999    Ciprofloxacin           38
    1-5     2000    Amikacin                52
    1-5     2000    Cefepime                34
    1-5     2000    Ceftazidime             45

..but much longer, there are 4 agecats, 14 years (1999-2012), and 10 drugs, each with a unique total. I want to spread 'year' and maintain total, basically to have a wideform 'timeline' of trends in agecat by drug. I.e.

agecat      drug       1999   2000   
<1          Amikacin    23      3    
<1          Cefepime     3      6

I've had great luck with spread in tidyr when there's only two grouping variables, but adding the third seems to throw it all off...I either get a sparse matrix or a names matching error. With melt/dcast, I've tried adding an ID column (1:n rows) but still no go...and I'm having trouble understanding what function to apply when I don't want to change the data, just make it wideform. What am I missing? Thanks!

jlev514
  • 281
  • 5
  • 15
  • Are you looking for `reshape2::dcast(df, agecat + Drug ~ year, value.var = "total")`? – CL. Aug 21 '15 at 08:01
  • Hi @user2706569, thanks, I thought so but for some reason the 'value.var' argument doesn't work and it says "Aggregation function missing: defaulting to length", meaning each entry is a 1 or 2. – jlev514 Aug 21 '15 at 08:12
  • In the data you showed, there is only one observation for each agecat per year and drug. In your dataset this seems not to be the case. Therefore `reshape2` asks you how to aggregate the values for the agecat-drug-year observations. – CL. Aug 21 '15 at 08:15
  • There is only one observation for each comb. of agecat + year + drug. I think the problem is that the totals sometimes repeat, but putting a unique ID (1:n rows) and running `> agem<-dcast(age, agecat + Drug + ID ~ year, value.var = 'total')` results in a sparse matrix. – jlev514 Aug 21 '15 at 08:22
  • Update: reforming the original table to group by 'year' and then 'agecat' allows the first reshape function `reshape2::dcast(df, agecat + Drug ~ year, value.var = "total")` to work! Why?! – jlev514 Aug 21 '15 at 08:26
  • Please execute the code from the answer I posted. You will see that it works smoothly as long as agecat-drug-year is unique. The only explanation for you getting different results is different input data. – CL. Aug 21 '15 at 08:27

2 Answers2

1

you can achieve your goal with spread of tidyr:

library(tidyr)
spread(df, year, total)

  agecat          Drug 1999 2000
1     <1      Amikacin   12    3
2     <1      Cefepime    3    6
3     <1   Ceftazidime   13    3
4     <1 Ciprofloxacin   NA    4
5    1-5      Amikacin   37   52
6    1-5      Cefepime   25   34
7    1-5   Ceftazidime   38   45
8    1-5 Ciprofloxacin   38   NA
SabDeM
  • 7,050
  • 2
  • 25
  • 38
0

Data from the question:

df <- structure(list(agecat = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("<1", "1-5"), class = "factor"), 
    year = c(1999L, 1999L, 1999L, 2000L, 2000L, 2000L, 2000L, 
    1999L, 1999L, 1999L, 1999L, 2000L, 2000L, 2000L), Drug = structure(c(1L, 
    2L, 3L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L), .Label = c("Amikacin", 
    "Cefepime", "Ceftazidime", "Ciprofloxacin"), class = "factor"), 
    total = c(12L, 3L, 13L, 3L, 6L, 3L, 4L, 37L, 25L, 38L, 38L, 
    52L, 34L, 45L)), .Names = c("agecat", "year", "Drug", "total"
), class = "data.frame", row.names = c(NA, -14L))

reshape2::dcast provides an easy way to make the data set "wider" such that the years form columns containing the value from total:

dcast(df, agecat + Drug ~ year, value.var = "total")

#   agecat          Drug 1999 2000
# 1     <1      Amikacin   12    3
# 2     <1      Cefepime    3    6
# 3     <1   Ceftazidime   13    3
# 4     <1 Ciprofloxacin   NA    4
# 5    1-5      Amikacin   37   52
# 6    1-5      Cefepime   25   34
# 7    1-5   Ceftazidime   38   45
# 8    1-5 Ciprofloxacin   38   NA

However, according to the comments, the actual data set is a little bit different. It seems like it has more than one observation for an agecat and a drug in a given year. Therefore, there is more than one value of total for each agecat-drug-year combination. As these values need to be aggregated somehow, reshape2 complains:

Aggregation function missing: defaulting to length

The solution is to aggregate the values to get the sum of total for a agecat and a drug in a given year. This is done by setting fun.aggregate = sum:

dcast(df, agecat + Drug ~ year, value.var = "total", fun.aggregate = sum)
CL.
  • 14,577
  • 5
  • 46
  • 73