I have a file that looks like this:
Chr-coordinate-coverage
chr1 236968289 2
chr1 236968318 2
chr1 236968320 2
chr1 236968374 2
chr1 237005709 2
chr14 22086843 2
chr14 22086846 2
chr14 22086849 2
chr14 22086851 4
chr2 5078129 2
chr2 5341758 2
chr2 5342443 2
I want to manipulate it to obtain:
chr-start-end-average coverage-distance
chr1 236968289 236968374 2 85
chr14 22086843 22086851 2.5 8
chr2 5078129 5078129 2 0
chr2 5341758 5342443 2 685
I want that: if chr is different from the previous chr or the difference between coordinates is bigger then 1000: it prints the output as shown. With the chr, the starting coordinate, the ending coordinate, the average coverage and the distance between start and end.
To do so, I wrote the following code:
cov=open("coverage.txt")
oldchr="chr55" #dummy starting data
oldcoordinate=1
sumcoverage=0
startcoordinate=0
try:
while True:
line=next(cov).split("\t",2)
newchr=line[0]
newcoordinate=int(line[1]) #read informations from file
newcoverage=int(line[2].strip())
if oldchr != newchr or newcoordinate - oldcoordinate > 1000:
distance=oldcoordinate-startcoordinate
averagecoverage=sumcoverage/distance
merge=oldchr+'\t'+str(startcoordinate)+'\t'+str(oldcoordinate)+'\t'+str(averagecoverage)+'\t'+str(distance)
print merge
startcoordinate=newcoordinate
sumcoverage=0
oldchr=newchr
oldcoordinate=newcoordinate #replace old with new chr and coordinates
sumcoverage=sumcoverage+newcoverage
except(StopIteration):
print ""
I am not able to understand why it doesn't work properly. The error I got is that the division to obtain the "average coverage" is trying to divide per 0, so in many cases the "distance" ( distance=oldcoordinate-startcoordinate) is equal to 0. This should not happen, in the input file is never the case that 2 lines have the same coordinate. I am not able to see where the error is. I hope someone can help me, thank you in advance.