1

I have two tab delimited csv files (with headers) that I need to merge in python.

Also, in the merged file I want to add a column in the end to identify the files because though they have same format, they have different data that I need to separate later on. So, I want to add a column called 'source' on each line of output which is 0 for file1 and 1 for file2.

I have gone far as using the csv module but the writerow adds an additioal newline character between each line it writes and this code doesn't write anything from file2. What am I doing wrong here? Also, how do I add the extra column 'source' in the line object?

import os, csv

path1 = os.path.abspath("../data/file1.txt")
path2 = os.path.abspath("../data/file2.txt")
merged_path = os.path.abspath('../data/output.txt')

# merge the two files for further processing
merged_file = csv.writer(open(merged_path, 'a'), delimiter = '\t')

#file1
fg = csv.reader(open(path1, 'r'), delimiter = '\t')

for line in fg:
    if line[7] != '\N':
        merged_file.writerow(line) 

#file2
bg = csv.reader(open(path2, 'r'), delimiter = '\t')

for line in bg:
    if line[16] != '\N':
        merged_file.writerow(line) 
sfactor
  • 12,592
  • 32
  • 102
  • 152

1 Answers1

2

I prefer to use the dictWriter for this. Also, your code doesn't work because the csv library requires opening files in binary mode.

import os, csv

path1 = os.path.abspath("../data/file1.txt")
path2 = os.path.abspath("../data/file2.txt")
merged_path = os.path.abspath('../data/output.txt')

#file1
fg = csv.DictReader(open(path1, 'rb'), delimiter = '\t')

fieldnames = fg.fieldnames
fieldnames.append('source')
# merge the two files for further processing
merged_file = csv.DictWriter(open(merged_path, 'ab'), delimiter = '\t', fieldnames=fieldnames)
merged_file.writeheader()

for row in fg:
    row['source'] = os.path.basename(path1)
    merged_file.writerow(row)

#file2
bg = csv.DictReader(open(path2, 'rb'), delimiter = '\t')

for row in bg:
    row['source'] = os.path.basename(path1)
    merged_file.writerow(row)
Spencer Rathbun
  • 14,510
  • 6
  • 54
  • 73
  • Thanks ! worked like at charm. This was my first real python code. Better start reading up on its nuances. – sfactor Feb 09 '12 at 14:05
  • The documentation on csv doesn't state that it needs binary mode. It does use it in examples but some Note about it would be nice. – rplnt Feb 09 '12 at 14:11
  • 1
    @rplnt [See here](http://docs.python.org/library/csv.html#csv.reader), the section on `If csvfile is a file object, it must be opened with the ‘b’ flag on platforms where that makes a difference.` – Spencer Rathbun Feb 09 '12 at 14:40
  • Binary mode is irrelevant on non-Windows platforms. – Wooble Feb 09 '12 at 14:52
  • @Wooble only because the os opens files as binary. The library still needs it opened in binary, but you don't have to think about that if you are not on Windows. – Spencer Rathbun Feb 09 '12 at 14:54
  • No, it doesn't. It opens them as text files, it's just that *nix systems use sensible line endings instead of Microsoft's abomination. – Wooble Feb 09 '12 at 14:55
  • @SpencerRathbun I have missed that. Thanks. – rplnt Feb 09 '12 at 16:29