0

I am working on R Studio.

I have ~50 txt files, all of which have data in similar format (14 col's in same order). Each file is large (~100MB). I have all the files saved in one folder. What I need to do it write an R code so that I can open one file, perform some operations, write the output to target file and close the file. After this I need to follow the process for the remaining files. I wanted to automate the process, but unsure on how to proceed with the same. Tried searching, but could not come across something that I could apply.

I have done this for one file, but not sure of the code that could help me automate the process for 50 files. Requesting help.....

library(data.table)
library(dplyr)
library(tidyverse)

#### Opening One File
myfile=fread("C:/Users/shegu/Desktop/LOB30SCRIP_010318.txt", 
sep="|",header=FALSE, stringsAsFactors = TRUE)

#### Renaming cols
colnames(myfile) <- c("Trading_Session", "Scrip_Code", "Buy_Sell", 
        "Order_Type", "Rate_in_Paise", "Quantity","Avl_Quantity", "Order_Time_Stamp", 
        "Retention", "AUD_Code", "Order_ID", "Action_ID", "Error_Code","ALGO_Flag")

#### Changing Format of cols

myfile$Order_Time_Stamp=as.Date(myfile$Order_Time_Stamp, "%Y-%m-%d %H:%M:%S")
myfile$Scrip_Code=as.factor(myfile$Scrip_Code)
myfile$Order_ID=as.factor(myfile$Order_ID)

#### Performing Group-by operation (this needs to be done on each file in my folder)

myfile_by_AUD_Code=myfile%>%
    group_by(Scrip_Code,ALGO_Flag,AUD_Code)%>%
    summarise(n())

#### Writing results to target file (need results for all files in this target file)
       write.csv(myfile_by_AUD_Code,"C:/Users/shegu/Desktop/BSE_Data/Target.csv", 
       row.names = FALSE)

Phil
  • 7,287
  • 3
  • 36
  • 66
  • Approach this methodically. Generate a list of all the files, and perform the action for all of the files in this list. Save the output to a target object, then write that object to file. – mhovd Mar 30 '20 at 10:28
  • Thank you mhh. In fact I an unable to come-up with the code to open files one by one and perform the required operations. Further I would also remove the file from r as each file size is quite big. Appreciate the help,. – Abhishek G Mar 30 '20 at 10:54

1 Answers1

0

You have most of the steps. You just need to generalize them with a function applied over all files in a directory. I propose you the following. By the way, you load too many packages: since you use fread keep a data.table approach.

library(data.table)

files <- list.files("C:/Users/shegu/Desktop/",
                    full.names = TRUE, pattern = ".csv")


clean_file <- function(filename){

  data <- fread(filename, sep="|",header=FALSE, stringsAsFactors = TRUE)

  #### Renaming cols
  data.table::setnames(data, new = c("Trading_Session", "Scrip_Code", "Buy_Sell", 
                               "Order_Type", "Rate_in_Paise", "Quantity","Avl_Quantity", "Order_Time_Stamp", 
                               "Retention", "AUD_Code", "Order_ID", "Action_ID", "Error_Code","ALGO_Flag"),
                       old = colnames(data)
  )

  data[, Order_Time_Stamp := as.Date(Order_Time_Stamp, "%Y-%m-%d %H:%M:%S")]
  data[, Scrip_Code := as.factor(Scrip_Code)]
  data[, Order_ID := as.factor(Order_ID)]

  #### Performing Group-by operation (this needs to be done on each file in my folder)
  myfile_by_AUD_Code <- data[, .(nobs = .N) ,by = c("Scrip_Code","ALGO_Flag","AUD_Code")]

  ### Write file
  fwrite(myfile_by_AUD_Code, gsub(".csv","_summary.csv",filename))

  return(NULL)  
}

lapply(files,clean_file)

filenames are detected using list.files function. I add a pattern argument to ensure only csv files are listed. Add other elements if there are files in your directory you don't want to read.

I proposed you to write the files as old_filename_summary.csv Change that line if it does not suit you.

The rest is just a traduction in data.table of your example

linog
  • 5,786
  • 3
  • 14
  • 28
  • Thank you for the detailed explanation linog. However, when I an running the code I am getting the following error: Error in data.table::setnames(data, new = c("Trading_Session", "Scrip_Code", : 'old' is length 1 but 'new' is length 14 . Tried resolving by looking into problems with setnames, but no luck. Would be great if you could assist on the same..Also If you could help me understand when to put the filepath for the target files to be saved. Thank you again for taking time for this – Abhishek G Mar 31 '20 at 04:29
  • It means that some files do not have 14 columns but just one: they are not using `"|"` as separator. Either all your files do not have the same structure or you're importing files you don't want to. Are you sure all `csv` in your directory need should be imported by `R`? I would start by looking at the values within `file` vector and ensure there are only files interesting you. Maybe creating a subdirectory containing only the files that interest you could be of help (then change the directory `list.files` point to) and filter `csv` that are not needed. – linog Mar 31 '20 at 05:57
  • Thanks a lot linog, it worked like a charm. Indeed one of the files in my repository was having different number of coloums. – Abhishek G Mar 31 '20 at 07:44