Purr or loop over table to automate the binding rows process

Question

I have the following table:

df <- tribble(~name,                                               ~label,     ~year,    ~id,
              "base1.dta", "Generated biographical information",                1990,  "gbi",
              "base2.dta", "Generated biographical information",                1991,  "gbi",
              "base3.dta", "Generated biographical information",                1992,  "gbi",
              "base4.dta", "Generated biographical information",                1993,  "gbi",
              "base5.dta", "Data on children from household questionnaire",     1990, "dchq",
              "base6.dta", "Data on children from household questionnaire",     1991, "dchq",
              "base7.dta", "Data on children from household questionnaire",     1992, "dchq",
              "base8.dta", "Data on children from household questionnaire",     1993, "dchq",
              "base9.dta", "Data from individual questionnaires",                1990,  "diq",
              "base10.dta", "Data from individual questionnaires",               1991,  "diq",
              "base11.dta", "Data from individual questionnaires",               1992,  "diq",
              "base12.dta", "Data from individual questionnaires",               1993,  "diq")

The data frames contained in the name column are all in the same path of my project with the same name as in the df. I want to loop or purrr over this table (which of course is much longer) in the following way: if they have the same value in the label column, search the corresponding name provided by the name column and bind_rows all of those data frames and assign them to a data frame called id. Then, I want to save those objects named by id as .rds in a different path.

It'd be helpful if you can provide example data and expected output. Doesn't need to be your actual data, even just a few toy data frames plus output will make it easier to help you. — andrew_reece, Jul 31 '21 at 16:51
what's the logic of different path? I mean is that path also saved/listed anywhere in your data? Or just randomly in a different paths? — AnilGoyal, Aug 01 '21 at 08:11

andrew_reece · Accepted Answer · 2021-08-01T00:22:07.720

Given that your label and id column both repeat in the same pattern, and you want the output to be labeled by id, you can ignore label.

You also don't need purrr - just group by id and name, read in your data, and then bind rows with summarise.

Using @Serkan's data_test with an id column added.

library(tidyverse)

data_test %>% 
  group_by(id, name) %>% 
  summarise(df = list(read.csv(name))) %>% 
  summarise(joined = list(bind_rows(df)))

  id    joined        
  <chr> <list>        
1 iri   <df [300 × 5]>
2 mtc   <df [64 × 11]>

To write to Rds, you can group by id and then write_rds.

... %>% 
  group_by(id_) %>% 
  group_walk(~write_rds(.x$joined, paste0(.y$id_, ".rds")))

Data

data_test <- tribble(
  ~name, ~label, ~id,
  "mtcars_1.csv", "mtcars", "mtc",
  "mtcars_2.csv", "mtcars", "mtc",
  "iris_1.csv", "iris", "iri",
  "iris_2.csv", "iris", "iri"
)

I think this is the best answer as it uses the id. Nonetheless, I cannot find out how to save each element of the resulting data_test in a different path. My initial problem finisishes by saying: Then, I want to save those objects named by id as .rds in a different path. — Paula, Jul 31 '21 at 21:56

Serkan · Answer 2 · 2021-07-31T17:52:57.617

I replicated your data.frame, and saved mtcars and iris twice. To automate the process, you could start by split your data.frame by label which I assume you want to bind_rows on.

Then I use a nested map to read the path given by your data.frame called df (Here in my example data_test) and use read.table.

Clearly you can use any type of data loading functions.

data_test <- tribble(
        ~name, ~label,
        "mtcars_1.csv", "mtcars",
        "mtcars_2.csv", "mtcars",
        "iris_1.csv", "iris",
        "iris_2.csv", "iris"
)


data_test %>% split(
        f = .$label
) %>% map(
        .f = function(x) {
                
                x$name %>% map(.f = function(x){
                        
                        read.table(x)
                       
                        }
                        
                        ) %>% reduce(bind_rows)
                
        }
)

This will load all the data.frame given under the name variable grouped by label and bind_rows accordingly.

Edit: As @Anoushiravan pointed out, you can replace read.table with haven::read_dta(x) to load data from stata!

Unfortunately, we do not always have the chance to choose in which format we want to receive the data we have to work with. — Paula, Jul 31 '21 at 21:58
@Paula - Is the addition of saving after `bind_rows` new? And did you find a solution to it? I did not notice untill now. I can update my answer accordingly if needed be! :-) — Serkan, Aug 02 '21 at 14:35

Purr or loop over table to automate the binding rows process

2 Answers2