1

I am quite new to R and I have a problem which I think needs solving by a loop...

I have multiple .xlsx files in AWS S3:

(note the naming of each do not follow a pattern)

  • File name 1: climate_change.xlsx

  • File name 2: farming.xlsx

  • etc etc

Each file is identical in structure but contains very messy data . Once each file is read into R, I have the same code to tidy each file.

What I would like to be able to do:

Within a loop:

  • Read in file 1, apply the tidying code...

  • Read in file 2, apply the tidying code...

  • etc etc

Then once we have looped through each file (reading and tidying) rbind each of the tidied files together.

What I have already:

All files are in the same AWS folder. The code I have to read in a single file from that folder is here:

data <- aws.s3::s3read_using(
                            FUN = readxl::read_xlsx,
                            object = "folder/climate_change.xlsx",
                            bucket = s3_bucket,
                            sheet = "sheet1")

I have some code for tidying which works on each single file.

I also have a .csv lookup which I can read in with each of the file names:

file_name
climate_change.xlsx
farming.xlsx

How would I create the loop?

fe108
  • 161
  • 7
  • If you have a list of the filenames/file paths you could then use `lapply()` to read each of the files at once. – wernor Aug 25 '22 at 10:29

1 Answers1

0

As @wernor suggests, this is the general approach you need, in pseudocode, since you haven't provided a MWE.

library(dplyr)
library(magrittr)

fileList <- myCSVLookup()

allMyData <- lapply(
               fileList,
               function(f) {
                 aws.s3::s3read_using(
                   FUN = readxl::read_xlsx,
                   object = f,
                   bucket = s3_bucket,
                   sheet = "sheet1"
                 ) %>% 
                 myTidyFunction()
               }
            ) %>% 
            bind_rows()
Limey
  • 10,234
  • 2
  • 12
  • 32
  • Thanks both - I am looking for bind rows as the data within each .xlsx is report data on each programme -- same reporting measures for each programme. @Limey - note your point on lack of MWE - thanks for presenting a solution despite this. For this to work, would fileList need to be a list of the .xlsx names? – fe108 Aug 25 '22 at 10:53
  • Oh, okay, my bad for assuming. About `fileList`, it could be an array, it doesn't need to be a list. You probably would need filepaths though if you are working on a different folder. – wernor Aug 25 '22 at 11:03