Regex and file processing

Question

This question relates to R but really isn't language specific per se. I have a bunch of csv files with this general format "sitename_03082015.csv". The files have 5 columns and various rows

Host    MaximumIn   MaximumOut  AverageIn   AverageOut
device1 30.63 Kbps  0 bps       24.60 Kbps  0 bps
device2 1.13 Mbps   24.89 Kbps  21.76 Kbps  461 bps
device5 698.44 Kbps 37.71 Kbps  17.49 Kbps  3.37 Kbps

I ultimately want to read in all the files and merge which I can do but during the merge I want to read the site name and date and add it to each related line so the output looks like this

Host      MaximumIn     MaximumOut  AverageIn   AverageOut  Site Name   Date
device1   30.63 Kbps    0 bps       24.60 Kbps  0 bps       SiteA       3/7/15
device12  1.13 Mbps     24.89 Kbps  21.76 Kbps  461 bps     SiteA       3/8/15
device1   698.44 Kbps   37.71 Kbps  17.49 Kbps  3.37 Kbps   SiteB       3/7/15
device2   39.08 Kbps    1.14 Mbps   10.88 Kbps  27.06 Kbps  SiteB       3/8/15
device3   123.43 Kbps   176.86 Kbps 8.62 Kbps   3.78 Kbps   SiteB       3/9/15

With my R code I can do the following:

#Get list of file names
filenames<- list.files(pattern = ".csv$")

#This extracts everything up to the underscore to get site name
siteName <- str_extract(string = filenames, "[^_]*")

# Extract date from file names use
date <- str_extract(string = filenames, "\\d{8}" )

With the below R code I can merge all the files but that will be without the added columns of site name and date that I want.

myDF<-do.call("rbind", lapply(filenames, read.table, header=TRUE, sep=","))

I just can't get my head around how to do the extracts for site and date adding and populating the columns to create my ideal dataframe which is the second table above.

The solution that best worked for me was posted below :)

romants · Accepted Answer · 2015-03-08T23:31:22.533

2

The way that immediately comes to my mind is to do cbind while reading information with additional infor and do rbind afterwards. Something similar to this:

 myDF<-do.call("rbind", 
          lapply(filenames, 
                 function(x) cbind(read.table(x, header=TRUE, sep=","), 
                                              "Site Name" = str_extract(string = x, "[^_]*"),
                                              "Date" = as.Date(str_extract(string = x, "\\d{8}"), "%m%d%Y"))))

edited Mar 08 '15 at 23:31

answered Mar 08 '15 at 18:40

romants

3,660
1
21
33

interesting I will play with this there's a lot going in in one line thanks for the reply. – JohnP Mar 08 '15 at 18:55
I tried running the sample code you gave me it just hangs but it's definitely on he right track – JohnP Mar 08 '15 at 19:38
Thanks for your help I got it to work your code was right on just needed to add a parent at the end. – JohnP Mar 08 '15 at 21:19

score 1 · Answer 2 · answered Mar 08 '15 at 22:06

I have done something similar which can be applied here. You can add more fileNames separated by comma. Also Site can be extracted similarly. Let me know if you need more help .

    ##Assuming your csv files are saved in location C:/"
    library(stringr)

    ##List all filenames
    fileNames <- c("hist_03082015.csv","hist_03092015.csv")

    ##Create a empty dataframe to save all output to
    final_df <- NULL

    for (i in fileNames) {
    ##Read CSV
    df <- read.csv(paste("C:/",i,sep=""),header = TRUE,
                 sep = ",",colClasses='character')
    ##Extract date from filename into a column
    df$Date <- gsub("\\D","",i)
    ##Convert string to date
    df$Date <-as.Date(paste(str_sub(df$Date, 1, 2),
              str_sub(df$Date,  3,-5),
              str_sub(df$Date, 5,-1),sep="-"),"%d-%m-%Y")
    ##save all data into 1 dataframe 
    final_df <- rbind(final_df,df)
    print(summary(final_df))
    }

Regex and file processing

2 Answers2