python manipulating files, logic error in script?

Question

I have a file that looks like this:

Chr-coordinate-coverage

chr1    236968289   2
chr1    236968318   2
chr1    236968320   2
chr1    236968374   2
chr1    237005709   2
chr14   22086843    2
chr14   22086846    2
chr14   22086849    2
chr14   22086851    4
chr2    5078129 2
chr2    5341758 2
chr2    5342443 2

I want to manipulate it to obtain:

chr-start-end-average coverage-distance

chr1  236968289  236968374 2    85    
chr14 22086843   22086851  2.5  8
chr2  5078129    5078129   2    0
chr2  5341758    5342443   2    685

I want that: if chr is different from the previous chr or the difference between coordinates is bigger then 1000: it prints the output as shown. With the chr, the starting coordinate, the ending coordinate, the average coverage and the distance between start and end.

To do so, I wrote the following code:

cov=open("coverage.txt")
oldchr="chr55"      #dummy starting data
oldcoordinate=1
sumcoverage=0
startcoordinate=0

try:
   while True:
    line=next(cov).split("\t",2)
    newchr=line[0]
    newcoordinate=int(line[1])       #read informations from file
    newcoverage=int(line[2].strip())

    if oldchr != newchr or newcoordinate - oldcoordinate > 1000:
        distance=oldcoordinate-startcoordinate
        averagecoverage=sumcoverage/distance
        merge=oldchr+'\t'+str(startcoordinate)+'\t'+str(oldcoordinate)+'\t'+str(averagecoverage)+'\t'+str(distance) 
        print merge
        startcoordinate=newcoordinate
        sumcoverage=0

    oldchr=newchr
    oldcoordinate=newcoordinate     #replace old with new chr and coordinates
    sumcoverage=sumcoverage+newcoverage


except(StopIteration):
    print ""

I am not able to understand why it doesn't work properly. The error I got is that the division to obtain the "average coverage" is trying to divide per 0, so in many cases the "distance" ( distance=oldcoordinate-startcoordinate) is equal to 0. This should not happen, in the input file is never the case that 2 lines have the same coordinate. I am not able to see where the error is. I hope someone can help me, thank you in advance.

Check that each time you are correctly reading the next line. And check that `newcoordinate` is updated correctly. — , Feb 11 '16 at 13:56
I checked, line and new coordinate are read correctly from the file at the beginning of every clycle — Elisa Rosati, Feb 11 '16 at 14:02
Inside the if you call `startcoordinate=newcoordinate` then outside the if you call `oldcoordinate=newcoordinate`. So `startcoordinate == oldcoordinate`. In the next iteration you are only updating `newcoordinate` but the distance is defined as `distance=oldcoordinate-startcoordinate` which will be 0. — , Feb 11 '16 at 14:07
I added an if distance== 0 : act differently. It is better now, still not exactly the result I want, but better, I guess there is something else wrong in the logic — Elisa Rosati, Feb 11 '16 at 14:19
The value `237005709` seems to have vanished from your expected output? — Martin Evans, Feb 11 '16 at 14:31

Martin Evans · Answer 1 · 2016-02-11T14:44:38.160

You could use Python's groupby function to group up your entries according to the chr column. The csv library also makes it easier to process the file:

from itertools import groupby
import csv

def display_block(block):
    average_coverage = sum(x[2] for x in block) / float(len(block))
    print block[0][0], "\t", block[0][1], "\t", block[-1][1], "\t", average_coverage, "\t", block[-1][1] - block[0][1]


with open('coverage.txt', 'rb') as f_coverage:
    for chr, entries in groupby(csv.reader(f_coverage, delimiter='\t'), lambda x: x[0]):
        entries = [(e[0], int(e[1]) , int(e[2])) for e in entries]
        block = []
        ientries = iter(entries)
        block.append(next(ientries))

        for chr, coord, coverage in ientries:
            if coord - block[-1][1] > 1000:
                display_block(block)
                block = []
            block.append([chr, coord, coverage])

        if len(block):
            display_block(block)

This would display the following output:

chr1    236968289   236968374   2.0     85
chr1    237005709   237005709   2.0     0
chr14   22086843    22086851    2.5     8
chr2    5078129     5078129     2.0     0
chr2    5341758     5342443     2.0     685

Each iteration of the for loop gives you a current chr and all matching rows for that chr. The script goes through each row and converts the coord and coverage to integers. It then makes the list it into an iterator. The first matching row is stored in block and then any remaining rows are iterated over. Each time the distance is greater than 1000 the block is displayed and restarted. Any remaining entries at the end are then also displayed.

Tested using Python 2.7.6

python manipulating files, logic error in script?

1 Answers1