2

What I have: ~100 txt files, each has 9 columns and >100,000 rows What I want: A combined file, with only 2 of the columns but all the rows. then this should be transposed for an output of >100,000 columns & 2 rows.

I've created the below function to go systematically through the files in a folder, pull the data I want, and then after each file, join together with the original template.

Problem: This works fine on my small test files, but when I try doing it on large files, I run into a memory allocation issue. My 8GB of RAM just isn't enough, and I assume that part of that is in how I wrote my code.

My Question: Is there a way to loop through the files and then join all at once at the end to save processing time?

Also, if this is the wrong place to put this kind of thing, what is a better forum to get input on WIP code??

##Script to pull in genotype txt files, transpose them, delete commented rows & 
## & header rows, and then put files together.

library(plyr)

## Define function
Process_Combine_Genotype_Files <- function(
        inputdirectory = "Rdocs/test", outputdirectory = "Rdocs/test", 
        template = "Rdocs/test/template.txt",
        filetype = ".txt", vars = ""
        ){

## List the files in the directory & put together their path
        filenames <- list.files(path = inputdirectory, pattern = "*.txt")
        path <- paste(inputdirectory,filenames, sep="/")


        combined_data <- read.table(template,header=TRUE, sep="\t")

## for-loop: for every file in directory, do the following
        for (file in path){

## Read genotype txt file as a data.frame
                currentfilename  <- deparse(substitute(file))
                currentfilename  <- strsplit(file, "/")
                currentfilename <- lapply(currentfilename,tail,1)

                data  <- read.table(file, header=TRUE, sep="\t", fill=TRUE)

                #subset just the first two columns (Probe ID & Call Codes)
                #will need to modify this for Genotype calls....
                data.calls  <- data[,1:2]

                #Change column names & row names
                colnames(data.calls)  <- c("Probe.ID", currentfilename)
                row.names(data.calls) <- data[,1]


## Join file to previous data.frame
                combined_data <- join(combined_data,data.calls,type="full")


## End for loop
        }
## Merge all files
        combined_transcribed_data  <- t(combined_data)
print(combined_transcribed_data[-1,-1])
        outputfile  <- paste(outputdirectory,"Genotypes_combined.txt", sep="/")        
        write.table(combined_transcribed_data[-1,-1],outputfile, sep="\t")

## End function
}

Thanks in advance.

smci
  • 32,567
  • 20
  • 113
  • 146
Gaius Augustus
  • 940
  • 2
  • 15
  • 37
  • 1
    This can be easily handled with [fread](http://www.inside-r.org/packages/cran/data.table/docs/fread) and `data.table` package. – user227710 Jun 25 '15 at 00:30

2 Answers2

3

Try:

filenames <- list.files(path = inputdirectory, pattern = "*.txt")
require(data.table)
data_list <- lapply(filenames,fread, select = c(columns you want to keep))

now you have a list of all you data. Assuming all the txt-files do have the same column-structure you can combine them via:

data <- rbindlist(data_list)

transposing data:

t(data)

(Thanks to @Jakob H for select in fread)

Rentrop
  • 20,979
  • 10
  • 72
  • 100
  • 1
    I would suggest that you remove the extra columns when reading in the data to take up less space: data_list <- lapply(filenames, function(.file){ fread(.file)[, c(columns needed), with = FALSE] } – Retired Data Munger Jun 25 '15 at 19:27
  • This worked well for me. Unfortunately, my write.table command crashes R Studio, I think because my combined file is SO big. Thank you! – Gaius Augustus Jun 25 '15 at 20:41
  • @DataMunger my file has spaces in its column headers ("Probe Set ID" instead of "Probe_Set_ID". I haven't had luck importing certain columns because of this. Not sure if there's a way to use your advice when that's the case. Am I missing something? – Gaius Augustus Jun 25 '15 at 20:44
  • Why not remove the header? If you don't remove unnecessary columns upon loading your data you are going to run out of memory. Also a more straight forward way to read in select columns is `lapply(filennames, function(x) fread(x, select = c(columns needed)` – Jacob H Jun 25 '15 at 22:08
  • I was having the issue that when I only loaded the 1 row I needed, the column/row names were disappearing. I assumed that this was because it was no longer a data.frame, so maybe using as.data.frame, I could fix that...? – Gaius Augustus Jun 26 '15 at 18:13
1

If speed/working memory is the concern then I would recommend using Unix to do the merging. In general, Unix is faster than R. Further, Unix does not require that all information be loaded into RAM, rather it reads information in chunks. Consequently, Unix is never memory bound. If you don't know Unix but plan to manipulate large files frequently in the future, then learn Unix. It is simple to learn and very powerful. I will do an example with csv files.

Generating CSV files in R

for (i in 1:10){
  write.csv(matrix(rpois(1e5*10,1),1e5,10), paste0('test',i,'.csv'))
}

In Shell (i.e on a Mac)/Terminal (i.e on a Linux Box)/Cygwin (i.e. on Windows)

cut -f 2,3 -d , test1.csv > final.csv #obtain column 2 and 3 form test1.csv
cut -f 2,3 -d , test[2,9].csv test10.csv | sed 1d >> final.csv #removing header in test2.csv onward 

Notice if you have installed Rtools, then you can run all these Unix commands from R with the system function.

To transpose read final.csv into R and transpose.

UPDATE:

I timed the above code. It took .4 secs to run. Consequently to do this for 100 files rather than just 10 files it will likely take 4 secs. I have not timed the R code, however, it may be the case that the Unix and R program have similar performance when there is only 10 files, however, with 100+ files, your computer will likely become memory bound and R will likely crash.

Jacob H
  • 4,317
  • 2
  • 32
  • 39
  • You're completely right. I've been working on getting better with this using Gywin on windows and the terminal on my Ubuntu boot. I know this is off-topic, but any good websites/tutorials to learn these kinds of computing commands? I'm learning R & Python through places like Datacamp and Coursera. Thank you. – Gaius Augustus Jun 25 '15 at 23:15
  • @GaiusAugustus First off, for data wrangling it is not necessary to know a lot of Unix. It is not like learning R, for example. With that mind, here are some good blog posts http://bconnelly.net/working-with-csvs-on-the-command-line/ & http://practical-data-science.blogspot.com/2012/09/basic-unix-shell-commands-for-data.html – Jacob H Jun 25 '15 at 23:33