I have a CSV file with 3 columns that looks like this:
a,b,c
1,1,2
1,3,5
1,5,7
.
.
2,3,4
2,1,5
2,4,7
I'd like the output to be like
a,b,c
1,5,7
1,3,5
1,1,2
.
.
2,4,7
2,3,4
2,1,5
i.e., for each element in column a, I'd like to have top 20 (20 highest 'b' values) rows only. Please excuse my poor explanation. I've tried this so far but that doesn't give me my required output:
import csv
import heapq
from itertools import islice
csvout = open ("output.csv", "w")
writer = csv.writer(csvout, delimiter=',',quotechar='"', lineterminator='\n', quoting=csv.QUOTE_MINIMAL)
freqs = {}
with open('input.csv') as fin:
csvin = csv.reader(fin)
rows_with_mut = ([float(row[1])] + row for row in islice(csvin, 1, None) if row[2])
for row in rows_with_mut:
cnt = freqs.setdefault(row[0], [[]] * 20)
heapq.heappushpop(cnt, row)
for assay_id, vals in freqs.iteritems():
output = [row[1:] for row in sorted(filter(None, vals), reverse=True)]
writer.writerows(output)