Using "summarise" (dplyr) with a function requiring a formula

Question

I am trying to generate a table of regression slopes generated by a custom function based on the mblm package (the function in the example here is a simplified version). The function requires a formula as argument and I would like to use dplyr summarise to apply it to grouped samples from a large data frame with many variables. The output should be a tibble of regression slopes for sample groups and response variables that I can pass to a heatmap function.

library (dplyr)

# Example data

test_data <- 
    rbind (
        data.frame(ID=paste0("someName", c(1:9)), Sample_Type="type1", 
           A=seq(1,17, length.out=9),
           I=0.1^seq(1,1.8,length.out=9), 
           J=1-0.1^seq(1,1.8,length.out=9)),
        data.frame(ID=paste0("someName", c(10:15)), Sample_Type="type2", 
           A=seq(1,7, length.out=6), 
           I=0.1^(1-seq(1,1.5,length.out=6)),
           J=1-0.1^(1-seq(1,1.5,length.out=6))))

# Define an independent and the responding variables - I would like to be able to easily test different independent variables
 
idpVar <- "A"
respVar <- test_data %>% .[!names(.) %in% c("ID", "Sample_Type", idpVar)] %>% names()

# Custom function generating numeric value of median slopes (simplified from mblm)

medianSlope <-
function (formula, dataframe) 
{
    if (missing(dataframe)) 
        dataframe <- environment(formula)
    term <- as.character(attr(terms(formula), "variables")[-1])
    x = dataframe[[term[2]]]
    y = dataframe[[term[1]]]
    if (length(term) > 2) {
        stop("Only linear models are accepted")
    }
    xx = sort(x)
    yy = y[order(x)]
    n = length(xx)
    slopes = c()
    smedians = c()
        for (i in 1:n) {
            slopes = c()
            for (j in 1:n) {
                if (xx[j] != xx[i]) {
                  slopes = c(slopes, (yy[j] - yy[i])/(xx[j] - 
                    xx[i]))
                 }
            }
            smedians = c(smedians, median(slopes))
        }
        slope = median(smedians)
    
    slope
}

# Custom function works with test dataframe and a single named dependent variable but "group_by" seems to be ignored:

test_data %>% group_by (Sample_Type) %>% medianSlope( formula(paste("J", "~", idpVar))  ,.)

Leaving the grouping issue aside for the moment, I tried to make "summarise" work by generating a list of multiple formulas:

paste(respVar, "~", idpVar) [1] "B ~ A" "C ~ A" "D ~ A" "E ~ A" "F ~ A" "G ~ A" "H ~ A" "I ~ A" "J ~ A" "K ~ A" "L ~ A"

However

test_data %>% summarise_at (respVar, medianSlope(paste(respVar, "~", idpVar), .))

Error: $ operator is invalid for atomic vectors

test_data %>% summarise_at (respVar, medianSlope(paste(get(respVar), "~", get(idpVar)), .))

Error in get(idpVar) : object 'A' not found

I am relatively new to R and a bit lost. Can you help?

bouncyball · Answer 1 · 2020-08-20T12:48:54.470

0

I am unsure if this will be able to be accomplished using the summarise_at function. However, we can use a combination of map_dbl, by, and some other data cleanup functions to perform the calculations:

library(tidyverse)

# split the data using `by` (acts as a group_by)
# use `map_dbl` to iterate over the variables in respVar
# we use setNames so that the returned vector from map_dbl is named
# then, bind the rows together, convert to data frame
# finally convert row names (groups) to a column
by(test_data, test_data$Sample_Type,
       FUN = function(d) map_dbl(setNames(respVar, respVar), 
                             ~medianSlope(formula(paste(.x, "~", idpVar)), 
                                          data = d))) %>%
    do.call("rbind", .) %>%
    as.data.frame() %>%
    rownames_to_column(var = "Sample_Type")

  Sample_Type            I            J
1       type1 -0.004623987  0.004623987
2       type2  0.341974269 -0.341974269

edited Aug 20 '20 at 12:48

answered Aug 18 '20 at 16:58

bouncyball

10,631
19
31

Bouncyball, thank you for your quick reply. I copied your solution and got it to work with a minor modification: map (respVar, ~medianSlopes(formula(paste(.x, "~", idpVar)), data = test_data), .)
However, the grouping of variables is still ignored. I have edited test_data and example code to make it clearer what I meant to achieve. – ThomasW Aug 19 '20 at 16:05
```{r} test_data %>% group_by (Sample_Type) %>% nest() %>% map (respVar, ~medianSlopes(formula(paste(.x, "~", idpVar)), data = test_data), .) %>% setNames(respVar) %>% bind_rows() #instead of bind_cols() ``` generates: # A tibble: 2 x 2 `1` `2` 1 -0.00572 0.000418 2 0.00572 -0.000418 However, the output I want is # A tibble: 2 x 3 Sample_Type I J type1 -0.000418 0.000418 type2 0.3419743 -0.3419743 – ThomasW Aug 20 '20 at 10:01
Correction: the output with values generated by the test_data should be # A tibble: 2 x 3 Sample_Type I J type1 -0.004623987 0.004623987 type2 0.3419743 -0.3419743 type1 -0.00462 0.00462 type2 0.292 -0.292 – ThomasW Aug 20 '20 at 10:24
@ThomasW thanks for those notes. I have completely updated my answer. I honestly do not know if this can be accomplished using `summarise_at`, so I proposed an alternative solution. I do wonder that if instead of using the `formula` interface in your function, you used `x` and `y` arguments it might be easier? – bouncyball Aug 20 '20 at 12:49

ThomasW · Answer 2 · 2020-10-12T19:13:37.467

Bouncyball, thanks again for your help. It seems indeed that "summarize" and "mutate" cannot call functions that use a formula as input, although I have not seen that explained anywhere else. The workaround was instructive but I followed you alternative suggestion and rewrote the called function. Still a learner, I set myself a challenge to replace "for" loops in the mblm-derived code, and to eliminate what seemed to be redundant calculations (at the expense of a higher demand on RAM- but it still runs much faster with my data on my PC, and I am planning to re-use the dx matrix in the next step of developing the code). The two solutions are below. Cheers, Thomas

mblm_2_short <-        # code adapted from mblm(y ~ x, repeated = T), for calculation of repeat median slope only 
function (x, y) 
{
xx = sort(x)
yy = y[order(x)]
n = length(xx)
slopes = c()
smedians = c()

    for (i in 1:n) {
        slopes = c()
        for (j in 1:n) {
            if (xx[j] != xx[i]) {
              slopes = c(slopes, (yy[j] - yy[i])/(xx[j] - 
                xx[i]))
             }
        }
        smedians = c(smedians, median(slopes))
    }
    slope = median(smedians)
}

.

med_slopesMed <-      # repeat median slope- like mblm(y ~ x, repeated = T), slope only                     
function (xx, yy) 
{
  x = sort(xx)
  y = yy[order(xx)]
  n = length(x)
  
  dx = matrix (rep (0,n^2), ncol=n)
  dy = c()
  z  = matrix (rep (0,n^2), ncol=n)
  
  for (i in 1:(n-1)) {                   ### x-axis distances (dx) and slopes (z) between points
    dxi             = x[-(1:i)]-x[i]
    dx [i, (i+1):n] = dxi                # for points 1:n, x-axis distances to all other points
    dyi             = y[-(1:i)]-y[i]
    zi              = dyi/dxi           
    z [i, (i+1):n]  = zi                 # for points 1:n, linear slopes connecting with all other points
  }
  
  z = replace(z, is.infinite(z), NA)           # removes +/-Inf and NaN generated by dx=0
  z  = t(z)[,-n] + z[,-1]

  median (apply(z, 1, median, na.rm=T)) 
}

Using "summarise" (dplyr) with a function requiring a formula

2 Answers2