Combine columns from several CSV files into a single file

Question

I have a bunch of CSV files (only two in the example below). Each CSV file has 6 columns. I want to go into each CSV file, copy the first two columns and add them as new columns to an existing CSV file.

Thus far I have:

import csv

f = open('combined.csv')
data = [item for item in csv.reader(f)]
f.close()

for x in range(1,3): #example has 2 csv files, this will be automated
    n=0
    while n<2:
        f=open(str(x)+".csv")
        new_column=[item[n] for item in csv.reader(f)]
        f.close()
        #print d

        new_data = []

        for i, item in enumerate(data):
            try:
                item.append(new_column[i])
                print i
            except IndexError, e:
                item.append("")
            new_data.append(item)

        f = open('combined.csv', 'w')
        csv.writer(f).writerows(new_data)
        f.close()
        n=n+1

This works, it is not pretty, but it works. However, I have three minor annoyances:

I open each CSV file twice (once for each column), that is hardly elegant
When I print the combined.csv file, it prints an empty row following each row?
I have to provide a combined.csv file that has at least as many rows in it as the largest file I may have. Since I do not really know what that number may be, that kinda sucks

As always, any help is much appreciated!!

As requested: 1.csv looks like (mock data)

1,a
2,b
3,c
4,d

2.csv looks like

5,e
6,f
7,g
8,h
9,i

the combined.csv file should look like

1,a,5,e
2,b,6,f
3,c,7,g
4,d,8,h
,,9,i

Could you provide (even if a mocked) example of inputs/outputs... — Jon Clements, Jan 25 '13 at 21:41
Thank you for updating your question - good job +1 from me. How large are these files? — Jon Clements, Jan 25 '13 at 21:48
They are small. Each file has 6 columns with less than 30 rows in each column. I have about 200 of these files — Stylize, Jan 25 '13 at 21:49

score 8 · Accepted Answer · edited Sep 01 '21 at 11:12

import csv
import itertools as IT

filenames = ['1.csv', '2.csv']
handles = [open(filename, 'rb') for filename in filenames]    
readers = [csv.reader(f, delimiter=',') for f in handles]
    
with  open('combined.csv', 'wb') as h:
    writer = csv.writer(h, delimiter=',', lineterminator='\n', )
    for rows in IT.izip_longest(*readers, fillvalue=['']*2):
        combined_row = []
        for row in rows:
            row = row[:2] # select the columns you want
            if len(row) == 2:
                combined_row.extend(row)
            else:
                combined_row.extend(['']*2)#This extends two empty columns
        writer.writerow(combined_row)
        
for f in handles:
    f.close()

The line for rows in IT.izip_longest(*readers, fillvalue=['']*2): can be understood with an example:

In [1]: import itertools as IT

In [2]: readers = [(1,2,3), ('a','b','c','d'), (10,20,30,40)]

In [3]: list(IT.izip_longest(readers[0], readers[1], readers[2]))
Out[3]: [(1, 'a', 10), (2, 'b', 20), (3, 'c', 30), (None, 'd', 40)]

As you can see, IT.izip_longest behaves very much like zip, except that it does not stop until the longest iterable is consumed. It fills in missing items with None by default.

Now what happens if there were more than 3 items in readers? We would want to write

list(IT.izip_longest(readers[0], readers[1], readers[2], ...))

but that's laborious and if we did not know len(readers) in advance, we wouldn't even be able to replace the ellipsis (...) with something explicit.

Python has a solution for this: the star (aka argument unpacking) syntax:

In [4]: list(IT.izip_longest(*readers))
Out[4]: [(1, 'a', 10), (2, 'b', 20), (3, 'c', 30), (None, 'd', 40)]

Notice the result Out[4] is identical to the result Out[3].

The *readers tells Python to unpack the items in readers and send them along as individual arguments to IT.izip_longest. This is how Python allows us to send an arbitrary number of arguments to a function.

thanks for the solution. I accepted yours because I understand the code best. However, would you mind explaining what "for rows in IT.izip_longest(*readers, fillvalue=['']*2):" does? Especially, the "*readers" is confusing to me... — Stylize, Jan 26 '13 at 01:49

DSM · Answer 2 · 2013-01-26T00:06:33.727

These days it seems almost obligatory for someone to give a pandas-based solution to any data processing problem in Python. So here's mine:

import pandas as pd

to_merge = ['{}.csv'.format(i) for i in range(4)]
dfs = []
for filename in to_merge:
    # read the csv, making sure the first two columns are str
    df = pd.read_csv(filename, header=None, converters={0: str, 1: str})
    # throw away all but the first two columns
    df = df.ix[:,:1]
    # change the column names so they won't collide during concatenation
    df.columns = [filename + str(cname) for cname in df.columns]
    dfs.append(df)

# concatenate them horizontally
merged = pd.concat(dfs,axis=1)
# write it out
merged.to_csv("merged.csv", header=None, index=None)

which for the files

~/coding/pand/merge$ cat 0.csv 
0,a,6,5,3,7
~/coding/pand/merge$ cat 1.csv 
1,b,7,6,7,0
2,c,0,1,8,7
3,d,6,8,4,5
4,e,8,4,2,4
~/coding/pand/merge$ cat 2.csv 
5,f,6,2,9,1
6,g,0,3,2,7
7,h,6,5,1,9
~/coding/pand/merge$ cat 3.csv 
8,i,9,1,7,1
9,j,0,9,3,9

gives

In [21]: !cat merged.csv
0,a,1,b,5,f,8,i
,,2,c,6,g,9,j
,,3,d,7,h,,
,,4,e,,,,

In [22]: pd.read_csv("merged.csv", header=None)
Out[22]: 
    0    1  2  3   4    5   6    7
0   0    a  1  b   5    f   8    i
1 NaN  NaN  2  c   6    g   9    j
2 NaN  NaN  3  d   7    h NaN  NaN
3 NaN  NaN  4  e NaN  NaN NaN  NaN

which I think is the right alignment.

steveha · Answer 3 · 2013-01-25T23:07:25.670

Here is a program I wrote to solve your problem. It makes a class that holds the information about each CSV file to read, including which columns you want from it. Then there is simply a list of CSV files to read, and a line is read from each.

Since you said it needs to keep returning rows until all the input files are read, it returns dummy values for input files that have reached the end. It keeps reading rows until all input files are done.

Also, this program only needs to hold one row at a time in memory. So it could process even large CSV files without needing much memory.

Originally I had a dummy value of -1 for missing data. Now I see you added an example and you just want no value. I've changed the program from using -1 to using an empty string when there is no data.

One of the design goals was to make this extendable. Right now you need the first two columns, but what if you later need columns 0, 3, and 7 from one of the files? So each file has a list with the columns to take.

I didn't actually write the code to rename the output file to the original filename but that is easy to add.

Ideally this whole thing would be wrapped up into a class, where you can iterate a class instance and get back one row put together using columns from all input files. I didn't take the extra time to do that, but if you will be using this over the long term you might want to do that. Also, I never bothered to close any of the input files, since I figure the program will end after we write the output file and everything will close then; but ideally we should close all files after we use them!

import csv

fname_in = "combined.csv"
fname_out = "combined.tmp"

lst_other_fnames = [str(x) + ".csv" for x in range(1, 3)]

no_data = ''
def _no_data_list(columns):
    return [no_data for _ in columns]

class DataCsvFile(object):
    def __init__(self, fname, columns=None):
        self.fname = fname
        self.f = open(fname)
        self.reader = csv.reader(self.f)
        self.columns = columns
        self.done = False
    def next_columns(self):
        if self.done:
            return _no_data_list(self.columns)

        try:
            item = next(self.reader)
        except StopIteration:
            self.done = True
            return _no_data_list(self.columns)

        return [item[i] for i in self.columns]

# want all columns from original file
data_csv_files = [DataCsvFile(fname_in, range(5))]

# build list of filenames and columns: want first two columns from each
data_csv_files.extend(DataCsvFile(fname, range(2)) for fname in lst_other_fnames)


with open(fname_out, "w") as out_f:
    writer = csv.writer(out_f)

    while True:
        values = []
        for df in data_csv_files:
            columns = df.next_columns()
            values.extend(columns)
        if not all(df.done for df in data_csv_files):
            writer.writerow(values)
        else:
            break

georg · Answer 4 · 2013-01-25T22:46:09.390

Here's an example (I'm using string io instead of files for simplicity, but that's not essential):

a = u"""
1,a
2,b
3,c
4,d
"""
b = u"""
5,e
6,f
7,g
8,h
9,i
"""
c = u"""
11,x
12,y
13,z
"""

import io, csv, itertools

data = []
expand = lambda it, size: it + [[''] * len(it[0])] * size

for f in [a, b, c]:
    with io.StringIO(f.strip()) as fp:
        d = list(csv.reader(fp))
        t = len(d) - len(data)
        data = d if not data else [
            x + y for x, y in itertools.izip_longest(
                expand(data, t), expand(d, -t))]

for r in data:
    print ','.join(r)    

# 1,a,5,e,11,x
# 2,b,6,f,12,y
# 3,c,7,g,13,z
# 4,d,8,h,,
# ,,9,i,,

with real files (named 1.csv, 2.csv etc) the main loop will look like this:

for n in range(...):
    with open(str(n) + '.csv') as fp:
        d = list(csv.reader(fp))
        t = len(d) - len(data)
        data = d if not data else [
            x + y for x, y in itertools.izip_longest(
                expand(data, t), expand(d, -t))]

nothing like "but that's not essential" to make someone feel stupid :). Would you mind editing for use with files? — Stylize, Jan 25 '13 at 22:21

Combine columns from several CSV files into a single file

4 Answers4

Linked

Related