7

I have a folder containing a bunch of CSV files that are titled "yob1980", "yob1981", "yob1982" etc.

I have to use a for loop to go through each file and put its contents into a data frame - the columns in the data frame should be "1980", "1981", "1982" etc

Here is what I have:

file_list <- list.files()

temp = list.files(pattern="*.txt")
babynames <- do.call(rbind,lapply(temp,read.csv, FALSE))

names(babynames) <- c("Name", "Gender", "Count")

I feel like I need a for loop, but I'm not sure how to loop through the files. Anyone point me in the right direction?

jrbedard
  • 3,662
  • 5
  • 30
  • 34
krypticlol
  • 73
  • 1
  • 1
  • 3
  • Are CSV files one column files with no headers? And do they correspond to same record ids? – Parfait Oct 15 '16 at 20:18
  • What you have already perform a loop through all the files (`lapply` executes an implicit `for` loop across all the files). And you are already producing a single data frame (`do.call(rbind, ....)`). What is the question? – Michael Griffiths Oct 15 '16 at 20:57
  • @Parfait the CSV files have no headers and there are three columns inside which contain a name, gender, and count of that name – krypticlol Oct 15 '16 at 21:21
  • @MichaelGriffiths I'm trying to add a column to the dataframe that includes the year that the name corresponds to. – krypticlol Oct 15 '16 at 21:21
  • What is `file_list` for? – Rich Scriven Oct 15 '16 at 21:55

4 Answers4

4

My favourite way to do this is using ldply from the plyr package. It has the advantage of returning a dataframe, so you don't need to do the rbind step afterwards:

library( plyr )
babynames <- ldply( .data = list.files(pattern="*.txt"),
                    .fun = read.csv,
                    header = FALSE,
                    col.names=c("Name", "Gender", "Count") )

As an added benefit, you can multi-thread the import very easily, making importing large multi-file datasets quite a bit faster:

library( plyr )
library( doMC )
registerDoMC( cores = 4 )
babynames <- ldply( .data = list.files(pattern="*.txt"),
                    .fun = read.csv,
                    header = FALSE,
                    col.names=c("Name", "Gender", "Count"),
                    .parallel = TRUE )

Changing the above slightly to include a Year column in the resulting data frame, you can create a function first, then execute that function within ldply in the same way you would execute read.csv

readFun <- function( filename ) {

    # read in the data
    data <- read.csv( filename, 
                      header = FALSE, 
                      col.names = c( "Name", "Gender", "Count" ) )

    # add a "Year" column by removing both "yob" and ".txt" from file name
    data$Year <- gsub( "yob|.txt", "", filename )

    return( data )
}

# execute that function across all files, outputting a data frame
doMC::registerDoMC( cores = 4 )
babynames <- plyr::ldply( .data = list.files(pattern="*.txt"),
                          .fun = readFun,
                          .parallel = TRUE )

This will give you your data in a concise and tidy way, which is how I'd recommend moving forward from here. While it is possible to then separate each year's data into it's own column, it's likely not the best way to go.

Note: depending on your preference, it may be a good idea to convert the Year column to say, integer class. But that's up to you.

rosscova
  • 5,430
  • 1
  • 22
  • 35
  • This way makes a dataframe instead of a list - I was having trouble converting Michael's method from a list to a dataframe. However how would I go about adding the years in a new column to my dataframe? Sort of like appending in python – krypticlol Oct 16 '16 at 19:55
  • Did you include the last line `rbind` of @Michael Griffiths' method? That should do the conversion to a data frame. – rosscova Oct 16 '16 at 20:33
  • What you're asking for doesn't sound like an `append`, rather a new column for each file. For most datasets, that's not a good idea. Are your `name` and `gender` columns the same for every file? – rosscova Oct 16 '16 at 20:38
  • `doMC` didn't work for me but `doFuture::registerDoFuture(); future::plan("multisession", workers = 8)` did. – radek Mar 02 '22 at 15:48
4

Using purrr

library(tidyverse)

files <- list.files(path = "./data/", pattern = "*.csv")

df <- files %>% 
    map(function(x) {
        read.csv(paste0("./data/", x))
    }) %>%
    reduce(rbind)
Icaro Bombonato
  • 3,742
  • 1
  • 17
  • 12
2

A for loop might be more appropriate than lapply in this case.

file_list = list.files(pattern="*.txt")
data_list <- vector("list", "length" = length(file.list))

for (i in seq_along(file_list)) {
    filename = file_list[[i]]

    # Read data in
    df <- read.csv(filename, header = FALSE, col.names = c("Name", "Gender", "Count"))

    # Extract year from filename
    year = gsub("yob", "", filename)
    df[["Filename"]] = year

    # Add year to data_list
    data_list[[i]] <- df
}

babynames <- do.call(rbind, data_list)
Michael Griffiths
  • 1,399
  • 7
  • 14
  • I changed # Extract year from filename year = gsub("yob", "", filename) df[["Filename"]] = year to manually increment the year because the .txt was trailing but thank you for the help! – krypticlol Oct 16 '16 at 01:36
2

Consider an anonymous function within an lapply():

files = list.files(pattern="*.txt")

dfList <- lapply(files, function(i) {
     df <- read.csv(i, header=FALSE, col.names=c("Name", "Gender", "Count"))
     df$Year <- gsub("yob", "", i) 
     return(df)
})

finaldf <- do.call(rbind, dflist)
Parfait
  • 104,375
  • 17
  • 94
  • 125