0

I'm using R and I have data as follow:

 data<-data.frame(time=c(20230301000000,20230301000010, 20230301000020, 20230301000030,20230301000040,
                        20230301000050, 20230301000100 , 20230301000110 , 20230301000120, 20230301000130,
                        20230301000140, 20230301000150, 20230301000200, 20230301000210, 20230301000220,
                        20230301000230, 20230301000240, 20230301000250,20230301000300),
                 switch=c(40,41,42,43,0,0,0,51,52,53,0,0,0,55,56,57,52,0,0),
                 NH4=c(2,2,3,3,3,5,4,9,9,9,10,11,12,4,4,5,5,7,8))  

Here, if switch >0 , then it means that the switch is on. Otherwise, if switch =0, then it means that the switch is off. switch always has nonnegative values. I want to calulate the time duration of on and off of switch in seconds. time is written in year, month, day, hour, minute, second order. Within each on/ off time duration, I want to calculate the minimum, averge , and maximum value of NH4. For the switch on case, I will denote it as on_NH4_min, on_NH4_avg, and on_NH4_max. For the switch off case, I will denote it as off_NH4_min, off_NH4_avg, off_NH4_max. So my desired output should looks like this:

summary_data <-data.frame(on_NH4_min=c(2,9,4),
                          on_NH4_avg=c(2.5,9,4,5),
                          on_NH4_max=c(3,9,5),
                          off_NH4_min=c(3,10,7),
                          off_NH4_avg=c(4,11,7,5),
                          off_NH4_max=c(5,12,8),
                          on_time=c(30,20,30),
                          off_time=c(20,20,10))  

Chat GPT gave me an answer, but it didn't work for me. Here is the code from Chat GPT.

library(dplyr)
library(tidyr)

# Create a variable that indicates whether the switch is on or off
data <- data %>% 
  mutate(switch_on = ifelse(switch > 0, 1, 0))

# Calculate the time duration of each on/off interval
data <- data %>% 
  mutate(interval = cumsum(switch_on != lag(switch_on, default = 0))) %>% 
  group_by(interval, switch_on) %>% 
  mutate(start_time = first(time),
         end_time = last(time),
         duration = difftime(end_time, start_time, units = "secs")) %>% 
  ungroup() %>% 
  select(-interval)

# Calculate the NH4 statistics for each on/off interval
summary_data <- data %>% 
  group_by(switch_on) %>% 
  summarise(NH4_min = min(NH4),
            NH4_avg = mean(NH4),
            NH4_max = max(NH4),
            time = sum(duration)) %>% 
  pivot_wider(names_from = switch_on,
              values_from = c(NH4_min, NH4_avg, NH4_max, time),
              names_prefix = c("on_", "off_"))

I want to get summary_data.

Lee
  • 369
  • 1
  • 6

1 Answers1

2

Using dplyr and tidyr:

library(dplyr)
library(tidyr)
data %>%
  group_by(r = consecutive_id(switch > 0)) %>%
  summarize(
    switch = if_else(switch[1] > 0, "on", "off"), 
    time = diff(range(time)), 
    across(NH4, list(min = ~ min(.), avg = ~ max(.), max = ~ max(.)))
  ) %>%
  mutate(grp = cumsum(switch == "on")) %>%
  select(-r) %>%
  pivot_wider(
    id_cols = grp, names_from = switch, 
    values_from = c(time, NH4_min, NH4_avg, NH4_max), 
    names_glue = "{switch}_{.value}") %>%
  select(-grp)
# # A tibble: 3 × 8
#   on_time off_time on_NH4_min off_NH4_min on_NH4_avg off_NH4_avg on_NH4_max off_NH4_max
#     <dbl>    <dbl>      <dbl>       <dbl>      <dbl>       <dbl>      <dbl>       <dbl>
# 1      30       60          2           3          3           5          3           5
# 2      20       60          9          10          9          12          9          12
# 3      30       50          4           7          5           8          5           8

In using across(NH4, ..), if your real data has more than just ammonium (I'm inferring) and you want to summarize them too, you can just add them, as in across(c(NH4, CO2), ..).

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • I tried your code, but it gave me an error like this: ```Error in app$vspace(new_style$`margin-top` %||% 0) : attempt to apply non-function``` I think I installed the packages, but I don't know why. – Lee Apr 25 '23 at 05:01
  • I'm using R-4.2.3, with `dplyr_1.1.0` and `tidyr_1.3.0`. Try reinstalling `cli` (ref: https://stackoverflow.com/questions/71857631/any-error-found-in-r-responds-with-error-in-appvspacenew-stylemargin-top#comment128503605_71857631, you may need to restart R before updating `cli`), but that error has little to do with the code above. – r2evans Apr 25 '23 at 05:04
  • For clarity, try adding `%>% as.data.frame()` after the `select(-grp)`. If that makes the error vanish, it confirms (to me) that it has nothing to do with this code. – r2evans Apr 25 '23 at 05:06
  • I tried that, but still get a same error message. MAybe I have the problem from the packages. – Lee Apr 25 '23 at 05:08
  • I reinstalled the `cli` package, then and tried your code again. It gave this error. ```Caused by error in `consecutive_id()``` Maybe it seems like the system in r can't find the function `consecutive_id` – Lee Apr 25 '23 at 05:20
  • what version of dplyr do you have? I mis-spoke above, I'm on `dplyr_1.1.1`, and in 1.1.0 the `consecutive_id` function was added. In a pinch, you can either use `data.table::rleid`, or you can fashion your own using `my_rleid <- function(z) { r <- rle(z); rep.int(seq_along(r$lengths), times = r$lengths); }` – r2evans Apr 25 '23 at 05:28
  • My version for `dplyr` is 1.0.7. I will try to update the version and see if the code works. – Lee Apr 25 '23 at 05:32
  • 1
    I reinstalled the pakage `dplyr` version 1.1.2 and your code worked for me! Thank you. – Lee Apr 25 '23 at 05:37