1

I want to use manual inputs to a QAQC 'log file' to update an existing dataframe. The following log file would indicate date ranges (bounded by datetime_min and datetime_max) for particular variable (or 'all' of them) observations to be omitted from the dateframe (set to NA).

library(tidyverse)
library(lubridate)

QC_log <- tibble(
  variable = c("SpCond", "pH", "pH", "all"),
  datetime_min = ymd_hms(c("2021-06-01 18:00:00","2021-07-19 18:00:00","2021-08-19 18:00:00","2021-11-23 18:00:00")),
  datetime_max = ymd_hms(c("2021-06-02 18:00:00","2021-07-25 21:00:00","2021-08-19 20:00:00","2021-11-26 05:00:00"))
)

The log should modify the following example of a dataframe, removing observations for each variable (for now I am not worried about 'all') that fall between the date min/max.

df <- tibble(
  Datetime = ymd_hms(c("2021-06-01 17:00:00","2021-06-01 18:00:00","2021-06-01 19:00:00","2021-11-23 16:00:00","2021-11-23 17:00:00","2021-11-23 18:00:00")),
  SpCond = c(220,225,224,230,231,235),
  pH = c(7.8,7.9,8.0,7.7,7.8,7.7)
)

I have tried pmap like this:

 df%>%
{pmap(QC_log, mutate(., ..1 = ifelse(Datetime > ..2 & Datetime < ..3, "NA", ..1)))}

I assumed pmap was taking ..1,2,3 from QC_log where ..1 is 'variable', ..2 is datetime_min, and ..3 is datetime_max, passing those as arguments into mutate one QC_log row at a time, which then conditionally replaces observations with NA if they fall into the specified date range.

I think I am having a hard time understanding ideas about non-standard evaluation/how arguments get passed through functions, among other things. Hopefully this is simple for now - I would like for this functionality to eventually be more complicated (e.g., changing all observations to NA when variable = 'all'; adding in separate actions like adding a data flag rather than omitting; or using a specific criterion (e.g., "<10") to omit observations rather than a daterange.

1 Answers1

1

You can do the following

inner_join(
  df %>% pivot_longer(cols=c("SpCond","pH")),
  QC_log,
  by=c("name" = "variable")
  ) %>%
  filter((Datetime<datetime_min) | (Datetime>datetime_max)) %>% 
  select(Datetime, name, value) %>%
  distinct() %>% 
  pivot_wider(id_cols = Datetime)

Output

  Datetime            SpCond    pH
  <dttm>               <dbl> <dbl>
1 2021-06-01 17:00:00    220   7.8
2 2021-06-01 18:00:00     NA   7.9
3 2021-06-01 19:00:00     NA   8  
4 2021-11-23 16:00:00    230   7.7
5 2021-11-23 17:00:00    231   7.8
6 2021-11-23 18:00:00    235   7.7

And here is a data.table approach

dcast(
  unique(melt(setDT(df), id="Datetime")[setDT(QC_log),on=.(variable),allow.cartesian=T,nomatch=0] %>% 
  .[(Datetime<datetime_min) | (Datetime>datetime_max), .(Datetime,variable,value)]),
  Datetime~variable, value.var="value"
)
langtang
  • 22,248
  • 1
  • 12
  • 27