3

I have a dataframe df with several locations var1 and var2. For every year for these locations the months where the temperature reaches an annual maximum maxmo or minimum minmo is given. I want to calculate the median month over the available years for which a certain location reaches an annual maximum or minimum.

df <- data.frame(
        variable = c("var1","var1","var1","var1","var2", "var2","var2","var2"), 
        year = c(2007:2010,2012:2015), 
        maxmo = c(10,8,8,7,7,8,8,8), 
        minmo=c(12,12,1,1,1,1,2,2))

I tried this

df %>%
  group_by(variable)%>%
  summarize(maxmo2= median(maxmo), minmo2=median(minmo))

which gives me

  variable maxmo2 minmo2
  <chr>     <dbl>  <dbl>
1 var1          8    6.5
2 var2          8    1.5

while I want to get

  variable maxmo2 minmo2
  <chr>     <dbl>  <dbl>
1 var1          8    12.5
2 var2          8    1.5

or

  variable maxmo2 minmo2
  <chr>     <dbl>  <dbl>
1 var1          8    0.5
2 var2          8    1.5

So var1 situates between December and January (12.5 or 0.5)

Jdh
  • 91
  • 5
  • 3
    I don't understand the problem. `median(c(12,12,1,1))` should be `6.5`. How do you want to arrive at `12.5`? – dww Apr 17 '23 at 13:30
  • 3
    I think the OP is wanting to say that the maximum temperature occurs between december and january (i.e. `12.5`), however, I agree this does not make sense logically. – Hansel Palencia Apr 17 '23 at 13:33
  • 2
    I was puzzled at that too, but indeed I think the desired output should be "half way between December and January" for that one (12.5?) – Andy Baxter Apr 17 '23 at 13:33
  • 1
    To do this, you would have to probably build a bespoke function (some type of for loop) that considers the new year and resets the 1 to a 13, in cases where the group by variable has more than 1 year included in it? – Hansel Palencia Apr 17 '23 at 13:34
  • @HanselPalencia and I have thought similarly! It could work to code a separate category for maxes and mins as "nth month of summer" and "nth month of winter" to get medians, then translate back to month names/numbers? – Andy Baxter Apr 17 '23 at 13:39
  • I'm not sure what this definition would look like in most cases. What if every month had the same number of occurrences? What would the median be? It seems if you are allowing values to wrap-around there would be no middle. Or what if every month had the same frequency except for October which had 0 or just 1 less than the others. What happens in that case? How are you defining median in this case? – MrFlick Apr 17 '23 at 13:53
  • 1
    Do you need the answer to be 12.5? Would the equivalent 0.5 (half way through january work? In which case you can use `median.month = function (months) Arg(median(exp(conv * months * 1i))) / (2*pi/12) %% 12` – dww Apr 17 '23 at 14:09
  • The question is reopened so I have moved my comments to an answer. – G. Grothendieck Apr 17 '23 at 15:29

1 Answers1

4

Map the months to a 24 clock and use the circular package:

library(circular)
library(dplyr)

med <- function(x, eps = 0.01) {
  cir <- circular(2 * (x - 1), units = "hours")
  num <- as.numeric(median(cir)) %% 24
  ((num + (num > (24 - eps))) %% 24) / 2 + 1
}
 
df %>%
  group_by(variable) %>%
  summarize(across(starts_with("m"), med), .groups = "drop")

giving

# A tibble: 2 x 3
  variable maxmo minmo
  <chr>    <dbl> <dbl>
1 var1         8  12.5
2 var2         8   1.5
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Almost what I'm looking for! I have another case where the median between December (12) and February (2) is taken and with this solution it gives 13 while ideally it would be 1 – Jdh Apr 18 '23 at 11:44
  • 1
    It may print as 13 but it cannot be 13 since ...%%24/2+1 cannot be 13 or larger so if you examine the number I would expect it to be less than 13. You will need to provide something reproducibke for anyone to look at it. – G. Grothendieck Apr 18 '23 at 12:14
  • Just in case I have added an eps argument to med. You may need to adjust the value. – G. Grothendieck Apr 18 '23 at 12:53