0

My INPUT file:

1,boss,30
2,go,35
2,nan,45
3,fog,33
4,kd,55
4,gh,56

Output file should be:

1,boss,30
3,fog,33

Means my output file should be free from duplicates. I should delete the record which is repeating based on the column 1.

Code I tried:

source_rd = csv.writer(open("Non_duplicate_source.csv", "wb"),delimiter=d)
gok = set()
for rowdups in sort_src:
    if rowdups[0] not in gok:
        source_rd.writerow(rowdups)
        gok.add( rowdups[0])

Output I got:

1,boss,30
2,go,35
3,fog,33
4,kd,55

What am I doing wrong?

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
Gokul Krishna
  • 105
  • 2
  • 5
  • For starters, take a look at [How do I format my posts using Markdown or HTML?](http://stackoverflow.com/help/formatting); I'll edit it for you this time so you can see how it works. – Tim Pietzcker Sep 23 '14 at 14:14
  • What is `sort_src`? Also, could you clarify why you didn't expect that output; the duplicates have been removed as required. – jonrsharpe Sep 23 '14 at 14:21

1 Answers1

0

You can just loop the file twice.

The first time through, count all the duplicates. Second time through fetch the ones of interest.

import csv

gok={}
with open(fn) as fin:
    reader=csv.reader(fin)
    for e in reader:
        gok[e[0]]=gok.setdefault(e[0], 0)+1

with open(fn) as fin:
    reader=csv.reader(fin)
    for e in reader:
        if gok[e[0]]==1:
            print e

Prints:

['1', 'boss', '30']
['3', 'fog', '33']

The reason your method does not work is that once the second instance of the duplicate is seen, the first has already been written.

dawg
  • 98,345
  • 23
  • 131
  • 206