How to extract the name of a column from a data frame to be used in the loop?

Question

I would like to copy the text of a data frame's column names one-by-one in a for loop. My code seems to return NULL values from the column name argument.

More broadly, I want to create a summary by factor of each of several columns.

# Create an example data frame
df <- data.frame( c( "a", "b", "c", "b", "c"), c( 6, 4, 10, 9, 11), c( 1, 3, 5, 3, 6))

colnames(df) <- c( "Group", "Num.Hats", "Num.Balls")

example data frame with each group member's number of hats and number of balls

Now I want to loop over columns two and three, creating a data object storing the summary statistics by Group. The point is to get a look at how groups A, B, and C differ from one another with respect to balls and with respect to hats.

My code looks like this:

# Evaluate stats of each group
for (i in 2:3){
    assign(paste0("Eval.", colnames(df[[i]])), tapply(df[,i], df$Group, summary))
}

I am getting a single object called "Eval." With the summary statistics for Num.Balls. To be clear, I would like two objects, one called Eval.Num.Hats and one called Eval.Num.Balls.

If colnames() cannot be used in this way, is there another function to achieve my desired result? Alternatively, I'd be open to another solution if the loop is not required.

I think you are looking at doing groupby summarize – YOLO Jan 21 '20 at 20:09 — YOLO, Jan 21 '20 at 20:09

score 2 · Accepted Answer · answered Jan 21 '20 at 20:09

The df[[i]] is extracting the column as a vector and there are no colnames. We can either use df[i] or the correct option is colnames(df)[i]

for (i in 2:3){
    assign(paste0("Eval.", colnames(df)[i]), tapply(df[,i], df$Group, summary))
 }

-output

Eval.Num.Hats
#$a
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#      6       6       6       6       6       6 

#$b
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   4.00    5.25    6.50    6.50    7.75    9.00 

#$c
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#  10.00   10.25   10.50   10.50   10.75   11.00 

Eval.Num.Balls
#$a
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#      1       1       1       1       1       1 

#$b
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#      3       3       3       3       3       3 

#$c
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   5.00    5.25    5.50    5.50    5.75    6.00

M-- · Answer 2 · 2020-01-21T20:46:39.137

You can avoid a for-loop altogether.

Explanation:

Here, using lapply I am looping over all columns (using their names) to be summarized, except the first one which is used for grouping (see what names(df1)[-1] returns).

with function basically attaches the dataframe so you don't need to do dataframe$column and you can simply type the column name.

by(variable to function, grouping variable, function) is used to apply summary by group.

We need to use the column name as variable and not character. That's why I am using mget() to convert the character name of the column to the variable.

smry.ls.df1 <- lapply(names(df1)[-1], function(col) with(df1, by(mget(col), Group, summary)))
names(smry.ls.df1) <- paste0("Eval.", names(df1)[-1]) #setting the names as you've shown

smry.list.df1

#> $Eval.Num.Hats
#> Group: a
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>       6       6       6       6       6       6 
#> -------------------------------------------------------- 
#> Group: b
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    4.00    5.25    6.50    6.50    7.75    9.00 
#> -------------------------------------------------------- 
#> Group: c
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   10.00   10.25   10.50   10.50   10.75   11.00 
#> 
#> $Eval.Num.Balls
#> Group: a
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>       1       1       1       1       1       1 
#> -------------------------------------------------------- 
#> Group: b
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>       3       3       3       3       3       3 
#> -------------------------------------------------------- 
#> Group: c
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    5.00    5.25    5.50    5.50    5.75    6.00

If you want them to be saved as separate objects (not recommended) you can use list2env:

list2env(smry.list.df1, globalenv())

Data:

df1 <- data.frame(Group = c( "a", "b", "c", "b", "c"), 
                  Num.Hats = c( 6, 4, 10, 9, 11), 
                  Num.Balls = c( 1, 3, 5, 3, 6))

Very helpful. Can you explain what function(col) does? Also, does col in mget(col) function as a sort of nonspecific, universal placeholder for data frame columns? — bhbennett3, Jan 22 '20 at 21:29

Jonathan V. Solórzano · Answer 3 · 2020-01-22T20:04:18.547

1

Here is another solution without any loops, using tidyr and broom.

library(tidyr)
library(broom)

df %>%
  #Change from wide to long format
  pivot_longer(cols = c("Num.Hats","Num.Balls"),
               names_to = "Var") %>%
  #group by Group (a,b,c) and Var (Num.Hats, Num.Balls)
  group_by(Group, Var) %>%
  #Calculate the summary function for each group
  do(tidy(summary(.$value)))

# A tibble: 6 x 8
# Groups:   Group, Var [6]
#  Group Var    minimum    q1 median  mean    q3 maximum
#  <fct> <chr>    <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>
#1 a     Num.B~       1  1       1     1    1          1
#2 a     Num.H~       6  6       6     6    6          6
#3 b     Num.B~       3  3       3     3    3          3
#4 b     Num.H~       4  5.25    6.5   6.5  7.75       9
#5 c     Num.B~       5  5.25    5.5   5.5  5.75       6
#6 c     Num.H~      10 10.2    10.5  10.5 10.8       11

edited Jan 22 '20 at 20:04

answered Jan 21 '20 at 21:27

Jonathan V. Solórzano

4,720
10
22

I really like how clean this code is. It makes sense to me to transpose the column names as variables. However, I am getting the following error: install.packages("tidy") Warning in install.packages : package ‘tidy’ is not available (for R version 3.6.1) – bhbennett3 Jan 22 '20 at 19:57
1

Oh, sorry. I missed an "r" at the end of `tidyr`. I already corrected the error on the post. It should work now. – Jonathan V. Solórzano Jan 22 '20 at 20:05

How to extract the name of a column from a data frame to be used in the loop?

3 Answers3