R: read in multiple .xlxs files from AWS S3

Question

I am quite new to R and I have a problem which I think needs solving by a loop...

I have multiple .xlsx files in AWS S3:

(note the naming of each do not follow a pattern)

File name 1: climate_change.xlsx
File name 2: farming.xlsx
etc etc

Each file is identical in structure but contains very messy data . Once each file is read into R, I have the same code to tidy each file.

What I would like to be able to do:

Within a loop:

Read in file 1, apply the tidying code...
Read in file 2, apply the tidying code...
etc etc

Then once we have looped through each file (reading and tidying) rbind each of the tidied files together.

What I have already:

All files are in the same AWS folder. The code I have to read in a single file from that folder is here:

data <- aws.s3::s3read_using(
                            FUN = readxl::read_xlsx,
                            object = "folder/climate_change.xlsx",
                            bucket = s3_bucket,
                            sheet = "sheet1")

I have some code for tidying which works on each single file.

I also have a .csv lookup which I can read in with each of the file names:

file_name
climate_change.xlsx
farming.xlsx

How would I create the loop?

If you have a list of the filenames/file paths you could then use `lapply()` to read each of the files at once. — wernor, Aug 25 '22 at 10:29

score 0 · Answer 1 · answered Aug 25 '22 at 10:35

0

As @wernor suggests, this is the general approach you need, in pseudocode, since you haven't provided a MWE.

library(dplyr)
library(magrittr)

fileList <- myCSVLookup()

allMyData <- lapply(
               fileList,
               function(f) {
                 aws.s3::s3read_using(
                   FUN = readxl::read_xlsx,
                   object = f,
                   bucket = s3_bucket,
                   sheet = "sheet1"
                 ) %>% 
                 myTidyFunction()
               }
            ) %>% 
            bind_rows()

answered Aug 25 '22 at 10:35

Limey

10,234
2
12
32

Thanks both - I am looking for bind rows as the data within each .xlsx is report data on each programme -- same reporting measures for each programme. @Limey - note your point on lack of MWE - thanks for presenting a solution despite this. For this to work, would fileList need to be a list of the .xlsx names? – fe108 Aug 25 '22 at 10:53
Oh, okay, my bad for assuming. About `fileList`, it could be an array, it doesn't need to be a list. You probably would need filepaths though if you are working on a different folder. – wernor Aug 25 '22 at 11:03

R: read in multiple .xlxs files from AWS S3

1 Answers1