Select variables from list of tibbles using names in each tibble

Question

I have a list of tibbles or data frames (each one with a name in the list) and I want to: (1) create new columns with the same name long.col in each tibble from variables with different names; (2) do this by matching from another tibble with the key linking tibble name and column name in each tibble and finally; (3) binding rows for all the newly created columns with the same name long.col in each tibble as well as identifying the original tibble where they came from.

I would like to do this preferably using tidyverse functions. Here is a an example of both: a) the list of tibbles; b) the key tibble identifying tibble names and variables to select in each one

df1 <- tibble(v1 = c(rep("A", 5), rep("B", 5)),
          v2 = 1:10)
df2 <- tibble(v1 = c(rep("C", 6), rep("D", 6)),
          v3 = 11:22)
df3 <- tibble(v1 = c(rep("E", 4), rep("F", 4)),
          v4 = 23:30)

list.df <- list(df1, df2, df3)
names(list.df) <- c("data1", "data2", "data3")

key <- tibble(data = c("data1", "data2", "data3"),
          vars = c("v2", "v3", "v4"))

The final output should look like:

 final.df <- tibble(data = c(rep("data1", 10), rep("data2", 12), rep("data3", 8)),
               long.col = 1:30)

I need to do this in a much longer list using multiple columns, so it is not feasible to do it separately for each column in each tibble.

markus · Accepted Answer · 2018-11-25T10:57:48.990

3

You can use map2 here

library(purrr)
library(tibble)
out <- map2_df(.x = list.df,
               .y = names(list.df),
               .f = ~ {
                 temp <- key[["vars"]][key[['data']] == .y]
                 tibble(data = .y, long.col = .x[[temp]])
                 })

Check the output

identical(final.df, out)
#[1] TRUE

edited Nov 25 '18 at 10:57

answered Nov 25 '18 at 07:45

markus

25,843
5
39
58

I actually found your 1st answer (with `pmap()`) even more elegant (because more compact and not requiring the assignment of temporary objects). May I ask you why decided to use `map2()` instead? Thank you – prosoitos Nov 25 '18 at 18:47
1

@prosoitos Thanks for your comment. I changed the answer because with `pmap` - as you probably know - we iterated over `list.df`, `names(list.df)` and `key[['vars']]` in parallel. We got the correct output because the elements in `key[['vars']]` are in the correct order. ... – markus Nov 25 '18 at 18:58
1

Thank you! I suspected that you had changed it to make it more general, but I wasn't sure. I was about to post an answer using purrr yesterday when you posted yours, but mine was not nearly as elegant and involved creating a function first. So both your answers were very educative for me :) – prosoitos Nov 25 '18 at 19:01
... if we were to change the order of the rows in `key` we'd get an error: "Call `rlang::last_error()` to see a backtrace". Best – markus Nov 25 '18 at 19:20

score 0 · Answer 2 · answered Nov 25 '18 at 08:06

The first step in the problem you have stated is to dynamically pick the variable name from the key table, and assign the value in the respective dataframe. This can be done by first defining a function which returns the base variable name (for long.col) based on the data frame.

getBaseVar <- function(dfName, keyTibble){
  varToBeTransformed <- keyTibble %>% dplyr::filter(data == (!!dfName)) %>% 
                       dplyr::select(vars) %>% dplyr::first() %>% 
                         rlang::sym(.)
  return(varToBeTransformed)

}

This function returns the required variable name as a symbol. This can then be passed to the dplyr::mutate function. In order to do this dynamically for all dataframes present in the list, we pass the named list of dataframes to the purrr::imap function, through which both the list element (dataframe in your case) and the name of the list element can be accessed.

list.df.transformed <- purrr::imap(list.df, function(df, name){
   df %>% dplyr::mutate( long.col := !!getBaseVar(name, key))
})

Finally, the output you desire can be created by passing this list of transformed dataframes through purrr::imap once again, and extracting the necessary data (i.e. the long.col column and the variable with a repeating value of the name of the dataframe). After extraction, passing it to the dplyr::bind_rows function returns the desired data frame.

final.df <- tibble(data = character(), long.col = numeric()) 
purrr::imap(list.df.transformed, function(df, name){
    repeatedNameCol <- tibble(data = rep(name, nrow(df)))
    dataToBind <- df %>% dplyr::select(long.col) %>% 
                     dplyr::bind_cols(repeatedNameCol)
    return(dataToBind)
}) %>% dplyr::bind_rows(.) -> final.df

Hope this helps!

Select variables from list of tibbles using names in each tibble

2 Answers2