6

I have a data frame with daily average temperature data in it, structured like so:

 'data.frame':  4666 obs. of  6 variables:
 $ Site : chr  "EB" "FFCE" "IB" "FFCE" ...
 $ Date : Date, format: "2013-01-01" "2013-01-01" "2013-01-01" "2014-01-01" ... 
 $ Day  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Year : int  2013 2013 2013 2014 2014 2014 2014 2015 2015 2015 ...
 $ Month: int  1 1 1 1 1 1 1 1 1 1 ...
 $ Temp : num  28.5 28.3 28.3 27 27.8 ...

i am attempting to produce a summary table which just totals the number of days in a year per site above certain temperatures thresholds e.g 25c, 26c. i can achieve this manually by using dplyr like so-

Days_above = Site_Daily_average %>% 
  group_by(Year, Site) %>% 
  summarise("23" = sum(Temp > 23), "24" = sum(Temp > 24),"25"= sum(Temp > 
25), "26"= sum(Temp > 26),  "27"= sum(Temp > 27), "28"= sum(Temp > 28), "29" 
= sum(Temp > 29),"30"= sum(Temp > 30), "31" = sum(Temp > 31), "ABOVE 
THRESHOLD" = sum(Temp > maxthreshold))%>% as.data.frame()  

Which produces a table like so :

   Year Site  23  24  25  26  27  28  29 30 31 ABOVE THRESHOLD
1  2012   EB 142 142 142  91  64  22   0  0  0               0
2  2012 FFCE 238 238 238 210 119  64   0  0  0               0
3  2012   IB 238 238 238 218 138  87   1  0  0               0
4  2013   EB 115 115 115 115 115 109  44  0  0               0
5  2013 FFCE 223 223 216 197 148 114  94  0  0               0
6  2013   IB 365 365 365 348 299 194 135  3  0               0

...

however, as you can see the code is fairly verbose. The problem i am having is producing this same output for a sequence of temperature thresholds, i.e Tempclasses = Seq(16,32,0.25).

As you can see that would take a long time to type that out manually. i feel like this is a very simple calculation and there should be way to use dplyr to recognize each variable in the sequence vector, perform this function and produce an output in a complete table format. sorry if that was unclear as i am relatively new to R, any suggestions would be welcome, thankyou.

K.west
  • 63
  • 3

2 Answers2

3

Here's a tidyverse approach, likewise using mtcars for illustration:

library(tidyverse)

mtcars %>% 
  mutate(threshold = cut(mpg, 
                         breaks=seq(10, max(mtcars$mpg)+10, 5), 
                         labels=seq(10, max(mtcars$mpg)+5, 5))) %>% 
  group_by(cyl, threshold) %>% 
  tally %>% 
  ungroup %>% 
  complete(threshold, nesting(cyl), fill=list(n=0)) %>% 
  arrange(desc(threshold)) %>% 
  group_by(cyl) %>% 
  mutate(N_above = cumsum(n)) %>% 
  select(-n) %>% 
  arrange(cyl, threshold)
   threshold cyl N_above
1         10   4      11
2         15   4      11
3         20   4      11
4         25   4       6
5         30   4       4
6         35   4       0
7         10   6       7
8         15   6       7
9         20   6       3
10        25   6       0
11        30   6       0
12        35   6       0
13        10   8      14
14        15   8       8
15        20   8       0
16        25   8       0
17        30   8       0
18        35   8       0

If you want the final data in wide format, add a spread at the end and remove the arrange:

... %>%
select(-n) %>% 
spread(threshold, N_above)
  cyl 10 15 20 25 30 35
1   4 11 11 11  6  4  0
2   6  7  7  3  0  0  0
3   8 14  8  0  0  0  0
eipi10
  • 91,525
  • 24
  • 209
  • 285
  • oh, i tried using spread earlier but wasnt arranging it properly apparently, thanks very much, thats awesome – K.west May 22 '18 at 03:39
  • I just realized I have the cumsum reversed when compared to your example. Do you want the number of days above, rather than below, the threshold? I can update my answer, if desired. – eipi10 May 22 '18 at 03:42
  • preferably above if thats ok – K.west May 22 '18 at 03:43
1

As @dww commented we can use cut to get the required format. I have tried this on mtcars dataset where we create range from 10 to 35 with step of 5 for mpg column.

df <- mtcars
df$group <- cut(df$mpg, seq(10, 35, 5))

and then we group by cyl and use table to get count of how many of them fall in the the respective buckets.

table(df$cyl, df$group)

#  (10,15] (15,20] (20,25] (25,30] (30,35]
#4       0       0       5       2       4
#6       0       4       3       0       0
#8       6       8       0       0       0

Now , if certain value is greater than 10, it is also greater than 15, hence the number in (15, 20) bucket should also include number from (10,15) bucket and number in (20, 15) bucket should include both the previous number. Hence, we need a row-wise cumsum for this table

t(apply(table(df$cyl, df$group), 1, cumsum))

#   (10,15] (15,20] (20,25] (25,30] (30,35]
# 4       0       0       5       7      11
# 6       0       4       7       7       7
# 8       6      14      14      14      14

For your case , the code would go

Site_Daily_average$group <- cut(Site_Daily_average$Temp, seq(16,32,0.25))

#and then do table to get required answer.
t(apply(table(Site_Daily_average$Year,Site_Daily_average$Site, 
              Site_Daily_average$group), 1, cumsum)
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • oh ok, so this provides counts of temperature in temperature classes determined by the sequence? what i was trying to achieve is to tally the total number of observations above each individual value in that sequence rather than a count per "temperature class", does that make sense? im not sure im explaining it clearly – K.west May 22 '18 at 02:27
  • ohh..wait so `sum(Temp > 24)` would also have values from `sum(Temp > 23)` and so on. The number would keep on growing incrementally across each row? – Ronak Shah May 22 '18 at 02:30
  • not quite, i think whats confusing is that sum( Temp > 24 ) is a slightly deceptive formula because (from what i understand) it is a logical test that provides the Sum of the _number_ of data points which are TRUE to that expression (i.e. over 24 c), rather than actually giving you a sum, so it effectively gives you a count of logical True data points? does that make sense? i had problems using the 'count' and 'Length' for some reason. it it makes more sense i essentiallly need the 'Count' of all temperatures above the values in the sequence / – K.west May 22 '18 at 02:43
  • yes, that is right. So any number which is greater than 24 is also greater than 23. So `sum(Temp > 24)` would always be greater than or equal to `sum(Temp > 23)`. I have edited the answer, see if it makes sense. – Ronak Shah May 22 '18 at 02:50