1

I am trying to rbind a large number of csv files that have many overlapping columns.

library(plyr)

filenames <- list.files("output/", pattern = "*.csv")
filenames = paste("output/", filenames, sep="")
frames = lapply(filenames,read.csv,header=TRUE)
input = do.call(rbind.fill,frames)
write.csv(input, file="stacked.csv", quote=FALSE, na="", row.names=FALSE)

The combined frame will have around 300,000 rows and the union of the columns is 3,000 or so and there are 104 files. Doing this my computer blows past its memory limit (32gb) and R crashes. I've also tried splitting the process up to deal with segmented memory. Again, no luck:

library(plyr)

filenames <- list.files("output/", pattern = "*.csv")
filenames = paste("output/", filenames, sep="")
input = lapply(filenames,read.csv,header=TRUE)
part1 = do.call(rbind.fill,input[1:30])
save(part1,part2,file="p1")
part2 = do.call(rbind.fill,input[31:70])
part3 = do.call(rbind.fill,input[71:104])

write.table(input, file="stacked.csv", quote=FALSE, na="", row.names=FALSE)

The above is just a sample of what I did. Once I load part1, part2, part3 back into memory they are ~6gb total. Before saving and loading my memory usage is around 20gb. I then try rbind.fill(part1,part2) and the memory usage spikes >32gb again.

Any suggestions on how to approach this? I'm open to other solutions (python, sql, etc).

Rob Richmond
  • 855
  • 6
  • 19
  • Do you know the complete list of column names and classes in advance? – mnel Nov 16 '12 at 00:10
  • 1
    My first instinct would be to try Python, since you don't have to read the files into memory to operate on them. You could iterate once through the files to build up the list of column names, then go through them again to actually read and concatenate the data. – Marius Nov 16 '12 at 00:19

2 Answers2

1

An algorithm that uses very low memory but disk instead would be:

  • 1) read the headers of all the files to find the unique set of columns,
  • 2) process each file line by line: add NAs to the missing columns and write that line to one large file.

Only when you are done, you can read that large file into your R session (if it is not too large.)

Other languages might be more appropriate for this kind of task. Perl comes to mind.

Edit: If you are interested, here is the code using Perl. Put it in a rbindfill.pl file and run as follows: perl rindfill.pl > output.csv

use strict;
use warnings;

my @files = glob "output/*.csv";
my %fields;

foreach my $file (@files)
   {
   open my $fh, '<', $file;
   my $header = <$fh>;
   chomp $header;
   map {$fields{$_} = 1} split ',', $header;
   close $fh;
   }

my @all_fields = keys %fields;
print join(',', @all_fields) . "\n";

foreach my $file (@files)
   {
   open my $fh, '<', $file;
   my $header = <$fh>;
   chomp $header;
   my @fields = split ',', $header;
   foreach my $line (<$fh>)
      {
      chomp $line;
      my $rec;
      @{$rec}{@fields} = split ',', $line;
      print join(',', map { defined $rec->{$_} ? $rec->{$_} : ''} @all_fields) . "\n";
      }
   close $fh;
   }
flodel
  • 87,577
  • 21
  • 185
  • 223
  • Thanks, I'll throw something together in python as I'm more literate with that. Should be easy enough using http://docs.python.org/2/library/csv.html#csv.DictWriter I'll post the code back here once I get a few minutes to put it together. – Rob Richmond Nov 16 '12 at 01:19
  • Thanks for the Perl code. Just for kicks I wrote a python script to do this which I posted below. – Rob Richmond Nov 16 '12 at 08:52
  • The original column order is not retained in the output of this script... How can I achieve this? – mixedbag99 Nov 14 '17 at 06:43
1

Here is the python code that I used. It also adds a column for the filename, drops anything that can't be converted to a float (text fields in particular), and skips a line (2 lines including the header) when writing to the output file.

import csv
import glob

files = glob.glob("data/*.txt")
csv.field_size_limit(1000000000)

outfile = "output.csv"

def clean_row(row,filename):
    for k,v in row.items():
        try:
            row[k] = float(v)
        except:
            row[k] = ""
    row['yearqtr'] = filename
    return row

headers = set()
for filename in files:
    with open(filename,"r") as infile:
        reader = csv.reader(infile)
        for header in next(reader):
            headers.add(header)

headers = list(headers)
headers.insert(0,"yearqtr")

with open(outfile, "w") as outfile:
    writer = csv.DictWriter(outfile,headers,restval="",extrasaction="ignore")
    writer.writeheader()
    for filename in files:
        with open(filename, "r") as infile:
            reader = csv.DictReader(infile)
            next(reader)
            writer.writerows((clean_row(row,filename) for row in reader))
Rob Richmond
  • 855
  • 6
  • 19