0

I have a CSV file with 3 columns that looks like this:

a,b,c
1,1,2
1,3,5
1,5,7
.
.
2,3,4
2,1,5
2,4,7

I'd like the output to be like

a,b,c
1,5,7
1,3,5
1,1,2
.
.
2,4,7
2,3,4
2,1,5

i.e., for each element in column a, I'd like to have top 20 (20 highest 'b' values) rows only. Please excuse my poor explanation. I've tried this so far but that doesn't give me my required output:

import csv
import heapq
from itertools import islice
csvout = open ("output.csv", "w")
writer = csv.writer(csvout, delimiter=',',quotechar='"', lineterminator='\n', quoting=csv.QUOTE_MINIMAL)
freqs = {}
with open('input.csv') as fin:
    csvin = csv.reader(fin)
    rows_with_mut = ([float(row[1])] + row for row in islice(csvin, 1, None) if row[2])
    for row in rows_with_mut:
        cnt = freqs.setdefault(row[0], [[]] * 20)
        heapq.heappushpop(cnt, row)

for assay_id, vals in freqs.iteritems():
    output = [row[1:] for row in sorted(filter(None, vals), reverse=True)]
    writer.writerows(output)
abn
  • 1,353
  • 4
  • 28
  • 58
  • so for a=1, you want the top 20 highest values of b, then top top 20 values for a=2 etc ? Is the file sorted? – ventsyv Oct 14 '14 at 23:31
  • Yes. It is sorted by lowest to highest of column a. @ventsyv – abn Oct 14 '14 at 23:36
  • @abn: thus only the first column is ordered, and the rest of the "columns" don't have any specific order (they follow the first column?). Is it required to by python? – Willem Van Onsem Oct 14 '14 at 23:42
  • Yes. Only the firs column is ordered and the rest follow. It need not be in Python, but I'm not familiar with anything other than Python, MySQL, Perl and R. @CommuSoft – abn Oct 14 '14 at 23:43

2 Answers2

2

since the file is only sorted on column a, you have to sort it on column b & c as well. I suggest using natsort, sorting the file in ascending or descending order than looping over it and printing 20 rows for each value of column a.

Something along the lines of:

import natsort

with open('myfile.csv', 'r') as inFile:
    lines = inFile.readlines()
    sortedList = reversed(natsort.natsorted(lines))
    #alternatively, you might want to try natsort.versorted() which is used for version numbers
    counter = 0
    prevAVal=currentAval=1
    for line in sortedList:
        currentAVal = ",".split(line)[0]
        if currentAVal != prevAval:
            counter = 0
        if counter < 20 :
                print line
        counter = counter + 1
        prevAVal=currentAVal
ventsyv
  • 3,316
  • 3
  • 27
  • 49
  • You never increment `counter`? Why don't you use `counter < 20` as stop condition? The script doesn't seem to sortthe values on `b`... – Willem Van Onsem Oct 15 '14 at 00:04
  • Yeah, I don't know what was I thinking... It's fixed now. Natsort will sort alpha-numerically, it even handles leading zeros (01, 011, 010 etc) and will sort those correctly. So while it's not explicitly sorting on column b, it will sort on the whole line. – ventsyv Oct 15 '14 at 00:24
1

On the risk of downvoting, you could use a simple bash script:

#!/bin/bash
all=$(cat) #read from stdin
echo "$all" | head -n 1 #echo the header of the file
allt=$(echo "$all" | tail -n +2) #remove the header from memory
avl=$(echo "$allt" | cut -d ',' -f 1 | sort | uniq) #find all unique values in the a column
for av in $avl #iterate over these values
do
    echo "$allt" | grep "^$av," | sort -t$',' -k2nr | head -n 20 #for each value, find all lines with that value and sort them, return the top 20...
done

You can run this in the command line with:

bash script.sh < data.csv

It will print the result on the terminal...

Example:

If one uses your sample values (without the "dot"-rows), one obtains:

user@machine ~> bash script.sh < data.csv 
a,b,c
1,5,7
1,3,5
1,1,2
2,4,7
2,3,4
2,1,5

If you want to write the result to a file (say data2.csv) use:

bash script.sh < data.csv > data2.csv

Don't read and write to the same file: don't run bash script.sh < data.csv > data.csv.

Willem Van Onsem
  • 443,496
  • 30
  • 428
  • 555