4

I have a dataframe and a number of conditions. Each condition is supposed to check whether the value in a certain column of the dataframe is within a set of valid values.

This is what I tried:

# create the sample dataframe
age <- c(120, 45)
sex <- c("x", "f")

df <-data.frame(age, sex)

# create the sample conditions
conditions <- list(
  list("age", c(18:100)),
  list("sex", c("f", "m"))
)

addIndicator <- function (df, columnName, validValues) {
  indicator <- vector()

  for (row in df[, toString(columnName)]) {
    # for some strange reason, %in% doesn't work correctly here, but always returns FALSe
    indicator <- append(indicator, row %in% validValues)
  }
  df <- cbind(df, indicator)

  # rename the column
  names(df)[length(names(df))] <- paste0("I_", columnName)

  return(df)
}

for (condition in conditions){
  columnName <- condition[1]
  validValues <- condition[2]
  df <- addIndicator(df, columnName, validValues)
}

print(df)

However, this leads to all conditions considered not to be met - which is not what I expect:

  age sex I_age I_sex
1 120   x FALSE FALSE
2  45   f FALSE FALSE

I figured that %in% does not return the expected result. I checked for the typeof(row) and tried to boil this down into a minimum example. In a simple ME, with the same type and values of the variables, the %in% works properly. So, something must be wrong within the context I try to apply this. Since this is my first attempt to write anything in R, I am stuck here.

What am I doing wrong and how can I achieve what I want?

Jonathan Scholbach
  • 4,925
  • 3
  • 23
  • 44
  • 1
    When you set `validValues` to `condition[2]`, your result is a list and not a vector; but you probably intended to feed your function with a vector. To extract the column values needed, try `validValues <- condition[[2]]` instead in your `for` loop. In addition, there is likely an easier or streamlined approach to establishing these indicators if interested... – Ben Jun 02 '20 at 12:34

3 Answers3

3

conditions appears to be a nested list. When you use:

validValues <- condition[2]

in your for loop, your result is also a list.

To get the vector of values to use with %in%, you can extract [[ by:

validValues <- condition[[2]]

A simplified approach to obtaining indicators could be with a simple list:

conditions_lst <- list(age = 18:100, sex = c("f", "m"))

And using sapply instead of a for loop:

cbind(df, sapply(setNames(names(df), paste("I", names(df), sep = "_")), function(x) {
  df[[x]] %in% conditions_lst[[x]]
}))

Output

  age sex I_age I_sex
1 120   x FALSE FALSE
2  45   f  TRUE  TRUE
Ben
  • 28,684
  • 5
  • 23
  • 45
3

If you prefer an approach that uses the tidyverse family of packages:

library(tidyverse)

allowed_values <- list(age = 18:100, sex = c("f", "m"))

df %>%
  imap_dfr(~ .x %in% allowed_values[[.y]]) %>%
  rename_with(~ paste0('I_', .x)) %>%
  bind_cols(df)

imap_dfr allows you to manipulate each column in df using a lambda function. .x references the column content and .y references the name.

rename_with renames the columns using another lambda function and bind_cols combines the results with the original dataframe.

I borrowed the simplified list of conditions from ben's answer. I find my approach slightly more readable but that is a matter of taste and of whether you are already using the tidyverse elsewhere.

severin
  • 2,106
  • 1
  • 17
  • 25
0

An alternative approach using across and cur_column() (and leaning heavily on severin's solution):

library(tidyverse)

df <- tibble(age = c(12, 45), sex = c('f', 'f'))
allowed_values <- list(age = 18:100, sex = c("f", "m"))

df %>%
  mutate(across(c(age, sex),
                c(valid = ~ .x %in% allowed_values[[cur_column()]])
                )
         )

Reference: https://dplyr.tidyverse.org/articles/colwise.html#current-column

Related question: Refering to column names inside dplyr's across()

s_pike
  • 1,710
  • 1
  • 10
  • 22