-1

I have a data set that contains occurrences of events over multiple years, regions, quarters, and types. Sample:

REGION Prov Year Quarter Type Hit Miss
xxx     yy  2008  4     Snow  1   0   
xxx     yy  2009  2     Rain  0   1

I have variables defined to examine the columns of interest:

syno.h <- data$Type
quarter.number<-data$Quarter
syno.wrng<- data$Type

I wanted to get the amount of Hits per type, and quarter for all of the data. Given that the Hits are either 0 or 1, then a simple sum() function using tapply was my first attempt.

tapply(syno.h, list(syno.wrng, quarter.number), sum)

this returned:

              1   2   3   4
ARCO         NA  NA  NA   0
BLSN          0  NA  15  74
BLZD          4  NA  17  54
FZDZ         NA  NA   0   1
FZRA         26   0 143 194
RAIN        106 126 137 124
SNOW         43   2 215 381
SNSQ          0  NA  18  53
WATCHSNSQ    NA  NA  NA   0
WATCHWSTM     0  NA  NA  NA
WCHL         NA  NA  NA   1
WIND         47  38 155 167
WIND-SUETES  27   6  37  56
WIND-WRECK   34  14  44  58
WTSM          0   1   7  18

For a some of the types that have no occurrences in a given quarter, tapply sometimes returns NA instead of zero. I have checked the data a number of times, and I am confident that it is clean. The values that aren't NA are also correct.

If I check the type/quarter combinations that return NA with tapply using just sum() I get values I expect:

sum(syno.h[quarter.number==3&syno.wrng=="BLSN"])
[1] 15
>  sum(syno.h[quarter.number==1&syno.wrng=="BLSN"])
[1] 0
>  sum(syno.h[quarter.number==2&syno.wrng=="BLSN"])
[1] 0
>  sum(syno.h[quarter.number==2&syno.wrng=="ARCO"])
[1] 0

It seems that my issue is with how I use tapply with sum, and not with the data itself.

Does anyone have any suggestions on what the issue may be?

Thanks in advance

  • 4
    `syno.wrng` is `NULL` because of the typo when you defined it. And also, `sum` is not meaningful for factors. A reproducible example would be terrific. – Rich Scriven Aug 16 '16 at 19:29
  • That is just an example of my code, in the actual code syno.wrng is fine. I've checked all of the inputs, and they all have the expected values. It is difficult for me to put up a reproducible example because I can't share the data I am working with. – Paul Greeley Aug 17 '16 at 10:53

1 Answers1

0

I have two potential solutions for you depending on exactly what you are looking for. If you just are interested in your number of positive Hits per Type and Quarter and don't need a record of when no Hits exist, you can get an answer as

aggregate(data[["Hit"]], by =  data[c("Type","Quarter")], FUN = sum)

If it is important to keep a record of the ones where there are no hits as well, you can use

dataHit <- data[data[["Hit"]] == 1, ]
dataHit[["Type"]] <- factor(data[["Type"]])
dataHit[["Quarter"]] <- factor(data[["Quarter"]])
table(dataHit[["Type"]], dataHit[["Quarter"]])
Barker
  • 2,074
  • 2
  • 17
  • 31
  • I tried that solution, and it returned the same data as tapply, but any of the types that returned NA were removed. I would still like to understand where those NA values come from in the first place. Why would sum() in tapply, and aggregate differ from using sum() on its own? – Paul Greeley Aug 17 '16 at 12:13
  • `aggregate` looks at all of the combinations that actually exist in your data and returns the sum of those. `tapply` on the other hand looks at all possible combinations of `syno.wrng` and `quarter.number` so if a combination doesn't exist in your data (ex. `ARCO` and `1`) it returns `NA` to indicate it doesn't exist. – Barker Aug 17 '16 at 15:33