0

I am dealing with a situation wherein I have multiple, distinct data sets with different column names, but the functions to be applied to them are similar. I thought, to reduce code duplication, I could create another dataset of column names, and the function to be applied to them:

  • raw data (whose column positions can change, so we rely on column headers)
  • dataframe with column headers and corresponding function to be applied
### The raw data set

df1 <- tibble(A=c(NA, 1, 2, 3), B = c(1,2,1,NA), 
C = c(NA,NA,NA,2), D = c(2,3,NA,1), E = c(NA,NA,NA,1))

# A tibble: 4 x 5
      A     B     C     D     E
  <dbl> <dbl> <dbl> <dbl> <dbl>
1    NA     1    NA     2    NA
2     1     2    NA     3    NA
3     2     1    NA    NA    NA
4     3    NA     2     1     1

### The dataframe containing functions

funcDf <- tibble(colNames = names(df1), type = c(rep("Compulsory", 4), "Conditional"))
funcDf$func <- c("is.na()", "is.na()", "is.na()", "is.na()", 
"ifelse(!is.na(D) & is.na(E), 0, ifelse(!is.na(D) & !is.na(E), 1, 0))")

# A tibble: 5 x 3
  colNames type        func                                                             
  <chr>    <chr>       <chr>                                                            
1 A        Compulsory  is.na()                                                          
2 B        Compulsory  is.na()                                                          
3 C        Compulsory  is.na()                                                          
4 D        Compulsory  is.na()                                                          
5 E        Conditional ifelse(!is.na(D) & is.na(E), 0, ifelse(!is.na(D) & !is.na(E), 1,~


I am able to get a simple sum running, like so:

df1 %>% summarise_at(.vars = funcDf$colNames, .funs = list(~sum(., na.rm = T)))

But I am not able to apply the functions I have recorded in the dataframe against the corresponding variable.

Any guidance, please :)

Edit

I expect to have the following output as a result of applying the function:

# A tibble: 1 x 5
      A     B     C     D     E
  <dbl> <dbl> <dbl> <dbl> <dbl>
1     1     1     3     1     2

@YinYan, thanks so much for indulging me, but for my comment, what if I need the following output (with grouping, as you can see in my code):

df1 %>% group_by(A, B) %>% summarise_all(.funs = list(~sum(., na.rm = T)))

# A tibble: 4 x 5
# Groups:   A [4]
      A     B     C     D     E
  <dbl> <dbl> <dbl> <dbl> <dbl>
1     1     2     0     3     0
2     2     1     0     0     0
3     3    NA     2     1     1
4    NA     1     0     2     0

info_seekeR
  • 1,296
  • 1
  • 15
  • 33

1 Answers1

1

I modified the function column, so they are now functions instead of string. Since the function for column E is always referencing df1, so I added with in the function.

funcDf$func <- c(
    function(x) is.na(x),
    function(x) is.na(x),
    function(x) is.na(x),
    function(x) is.na(x),
    function(x) with(data = df1, data.frame(E = ifelse(!is.na(D) & is.na(E), 0, ifelse(!is.na(D) & !is.na(E), 1, 0))))
)

result <- map_dfc(funcDf$colNames,function(colName){
    colFunc <- dplyr::pull(funcDf[funcDf$colNames == colName,"func"])[[1]]
    data.frame(colFunc(df1[,colName]))
})
> result
      A     B     C     D E
1  TRUE FALSE  TRUE FALSE 0
2 FALSE FALSE  TRUE FALSE 0
3 FALSE FALSE  TRUE  TRUE 0
4 FALSE  TRUE FALSE FALSE 1

To get the final result:

> summarise_all(result,sum)
  A B C D E
1 1 1 3 1 1

Answer based on new question

I have to modify the function column since this time column E function depends on different data frame. After use group_split() to split the original data frame into a list of data frames. You can then use for loop or map function to iterate the process. I personally like to use map functions since the codes are more concise.

funcDf$func <- c(
    function(x,...) is.na(x),
    function(x,...) is.na(x),
    function(x,...) is.na(x),
    function(x,...) is.na(x),
    function(x,df) with(data = df, data.frame(E = ifelse(!is.na(D) & is.na(E), 0, ifelse(!is.na(D) & !is.na(E), 1, 0))))
)
df_list <- df1 %>% group_by(A, B) %>% group_split()
map_dfr(df_list, function(parent_df){
    map_dfc(funcDf$colNames,function(colName){
        colFunc <- dplyr::pull(funcDf[funcDf$colNames == colName,"func"])[[1]]
        data.frame(colFunc(parent_df[,colName],df = parent_df))
    }) %>%
        summarise_all(sum)
})
  A B C D E
1 0 0 1 0 0
2 0 0 1 1 0
3 0 1 0 0 1
4 1 0 1 0 0
yusuzech
  • 5,896
  • 1
  • 18
  • 33
  • thank you! Apologies I think I did not clarify the output of my function enough. I was expecting to get a vector of T,F, so that I could summarise them (or, even summarise them during the function application. I added the expected output to my question – info_seekeR Oct 15 '19 at 16:29
  • 1
    I already modified my answer. This should provide the desired output. The way column E function is created may look strange. But it should be fine since it is always referencing the same columns in the same data frame. – yusuzech Oct 15 '19 at 16:39
  • Just wondering if there is a way to take grouping into account (as I need to apply the functions by groups)... trying to figure out a tidyeval way to do the entire bit as well, but given how you seem fluent with this, thought would check with you again :) – info_seekeR Oct 15 '19 at 20:48
  • I not sure what you mean exactly. Can you update your question and provide new example data to work with? If it is a very different question, please open a new post instead. – yusuzech Oct 15 '19 at 21:16
  • I just added the detail to my question – info_seekeR Oct 15 '19 at 21:25
  • thank you ever so much! I am digesting how you achieved this. I was trying to do this myself using map but kept getting lists (as is understandable, but originally I had thought this could be achieved using summarise... Thank you again! – info_seekeR Oct 15 '19 at 22:18