0

I would like to read in multiple comma separated years from the dashboard user through textInput with the ability to enter a range of numbers (e.g. 1910, 1980:1990, 2017). I then need to loop through each of the years in the list and remove them from a data table.

My function is shown below where daily_mean_Q is a data frame and excluded_years is an array c(1910, 1980:1990, 2017) from the user.

remove_years <- function(daily_mean_Q, excluded_years) {

  daily_mean_Q <- daily_mean_Q %>%
    mutate(Year = str_sub(Date, 1, 4))
  for(year in excluded_years) {
    daily_mean_Q <- daily_mean_Q %>%
      filter(Year != as.character(year))
  }

  daily_mean_Q <- daily_mean_Q %>%
    select(-Year)
}
Jon
  • 25
  • 4
  • It would be much better to use `filter(daily_mean_Q, !Year %in% excluded_years)`. – r2evans Jun 09 '20 at 23:21
  • BTW, `str_sub(Date,1,4)` is breaching scope of the function, `Date` is neither created within the function nor passed to it as an argument. If there is a `Date` defined in the parent environment (or higher), then this will do not complain about `object 'Date' not found`, but it is often bad practice to rely on that behavior, and it renders the function non-reproducible (meaning its output is not defined solely by the inputs). – r2evans Jun 09 '20 at 23:25
  • To see how `%in%` is better (I think) here, see https://stackoverflow.com/q/15358006/3358272 and https://stackoverflow.com/q/42637099/3358272 – r2evans Jun 09 '20 at 23:26
  • 1
    @r2evans For your second comment Date is a column name in the daily_mean_Q data frame and not a global variable. I was not aware of the %in% operator but will incorporate it in my code. Thanks! – Jon Jun 10 '20 at 02:16

2 Answers2

1

This is really a duplicate of Difference between `%in%` and `==`, since you're trying to use equality for a set-membership operation, even if you aren't (yet) trying %in%. (Unless I've completely misinterpreted your question.)

Basic equality of vectors vec1 and vec2 in R work in a few ways:

  • if vec2 (or vec1) is length 1, then each of vec1 is compared against it, as in vec1[1] == vec2[1], vec1[2] == vec2, as in

    1:10 == 3
    #  [1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
    
  • if length(vec1) == length(vec2), then we're happen the comparison is element-wise:

    1:10 == c(1, 2, 3, 99, 99, 6, 7, 99, 99, 99)
    #  [1]  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE
    
  • if length(vec1) length is an even multiple of length(vec2), then R silently recycles, and this is where of the confusion and problems occur. This means that

    1:10 == c(3, 2)
    #  [1] FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
    ### which is effectively
    1:10 == c(3, 2, 3, 2, 3, 2, 3, 2, 3, 2)
    #  [1] FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
    

    This seems right so far, this is by chance here. Ultimately, when we type 1:10 == c(2, 3), we're ultimately saying the 1st, 3rd, 5th, ... elements of vec1 are 2, and the 2nd, 4th, 6th, ... elements of vec1 are 3. Typically that's not what is intended, usually meaning set-membership instead. If it were doing set-membership, then we would expect that reversing the numbers in vec2 would have no effect ... but that's not true.

    1:10 == c(2, 3)
    #  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
    ### which is effectively
    1:10 == c(2, 3, 2, 3, 2, 3, 2, 3, 2, 3)
    #  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
    
  • if length(vec1) is not an even multiple of length(vec2), close to the above still occurs, but at least we see a warning:

    1:10 == c(3, 2, 1)
    # Warning in 1:10 == c(3, 2, 1) :
    #   longer object length is not a multiple of shorter object length
    #  [1] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
    ### which is effectively
    1:10 == c(3, 2, 1, 3, 2, 1, 3, 2, 1, 3) # uneven recycling
    #  [1] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
    

To sum up vector == operations, it is intended (and safe!) to compare vectors of the same length or when one of the vectors is length 1. While any other condition might not warn or error, the results are often not what is intended.


When you want to know which of vec1 are contained within vec2, then we need the %in% operator:

1:10 %in% c(2, 3)
#  [1] FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
### order in vec2 is not important
1:10 %in% c(3, 2)
#  [1] FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

This is effectively saying for each element in vec1, is that element == to any of the elements in vec2, which is effectively our first bullet above: the element is length 1, and vec2 is 1 or more. Bad pseudo-code loops demonstrating this:

for (el in vec1)       # el is length 1
  if (any(el == vec2)) # this works as intended per bullet 1 above
  then true
  else false
done

If your excluded_years is truly an integer vector, as in

excluded_years <- c(1957, 1960:1970, 1987)
excluded_years
#  [1] 1957 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1987

(Technically, this vector is numeric, not integer, but we'll ignore that distinction for now.)

Then we can simply filter on it:

library(dplyr)
filter(mtcars, ! cyl %in% c(4, 8))
#                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
# Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
# Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
# Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
# Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
# Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
# Merc 280C      17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
# Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6

and see that the data no longer contains the cyl values (which include 4, 6, and 8 only). With this, you could replace your function with one of:

remove_years <- function(daily_mean_Q, excluded_years) {
  daily_mean_Q %>%
    mutate(Year = as.integer(stringr::str_sub(Date, 1, 4))) %>%
    filter(! Year %in% excluded_years) %>%
    select(-Year)
}
remove_years <- function(daily_mean_Q, excluded_years) {
  daily_mean_Q %>%
    filter(! as.integer(stringr::str_sub(Date, 1, 4)) %in% excluded_years)
}

However, if your excluded_years is a string, as shiny fields tend to return, then we have a few options to convert this:

  • we might be tempted to structure it like R language and then eval it ... this works, but opens your app up to "injection" security problems:

    excluded_years <- "1957, 1960:1970, 1987"
    eval(parse(text = paste("c(", excluded_years, ")")))
    #  [1] 1957 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1987
    
    ### PROBLEM
    excluded_years <- "1957, 1960:1970); message('gotcha'); c("
    eval(parse(text = paste("c(", excluded_years, ")")))
    # gotcha
    # NULL
    
  • we should likely bake a home-grown function to split and split again, ensuring that the users know the rules

    excluded_years <- "1957, 1960:1970, 1987"
    strsplit(excluded_years, "[, ]+")
    # [[1]]
    # [1] "1957"      "1960:1970" "1987"     
    unlist(lapply(strsplit(excluded_years, "[, ]+")[[1]],
                  function(a) {
                    a <- strsplit(a, "[: ]+")[[1]]
                    if (length(a) == 1) return(as.integer(a))
                    if (length(a) == 2) return(seq(a[1], a[2]))
                    stop("unrecognized sequence");
                  }))
    #  [1] 1957 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1987
    
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Thanks for the in depth explanation! The last two code examples answered my question which was essentially how to turn "1957, 1960:1970, 1987" into c(1957, 1960:1970, 1987) – Jon Jun 10 '20 at 02:19
0

Edited: I got carried in by your example: You should use %in% instead of !=

Although I cannot say much without the data, I think you should get rid of the for loop.

daily_mean_Q <- daily_mean_Q %>%
      filter(!Year %in% as.character(excluded_years))

dplyr::filter can filter out multiple values. See example.

library(gapminder)
library(dplyr)
gapminder %>% 
  filter(!year %in% c(1952, 1957))
#> # A tibble: 1,420 x 6
#>    country     continent  year lifeExp      pop gdpPercap
#>    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
#>  1 Afghanistan Asia       1962    32.0 10267083      853.
#>  2 Afghanistan Asia       1967    34.0 11537966      836.
#>  3 Afghanistan Asia       1972    36.1 13079460      740.
#>  4 Afghanistan Asia       1977    38.4 14880372      786.
#>  5 Afghanistan Asia       1982    39.9 12881816      978.
#>  6 Afghanistan Asia       1987    40.8 13867957      852.
#>  7 Afghanistan Asia       1992    41.7 16317921      649.
#>  8 Afghanistan Asia       1997    41.8 22227415      635.
#>  9 Afghanistan Asia       2002    42.1 25268405      727.
#> 10 Afghanistan Asia       2007    43.8 31889923      975.
#> # ... with 1,410 more rows
  • `xyz != c(1952, 1957)` is only correct when `xyz` is length 2. Even then I find it arguably not a good idea. – r2evans Jun 09 '20 at 23:22
  • As a demonstration of why this might appear to work here but is structurally wrong, try `filter(year != c(1957, 1952))` (reversed values), see that you get a different return value. – r2evans Jun 09 '20 at 23:30