Merge fields in a file

Question

I have a file with 7 columns, a GFF file having chromosomal regions.I want to collapse the rows where REGION ="exon" to only one row in the file.The row has to be collapsed on the basis of regions being overlapping with each other.

REGION  START   END  SCORE STRAND FRAME     ATTRIBUTE
 exon   26453   26644   .   +   .   Transcript "XM_092971"; Name "XM_092971"
 exon   26842   27020   .   +   .   Transcript "XM_092971"; Name "XM_092971"
 exon   30355   30899   .   -   .   Transcript "XM_104663"; Name "XM_104663"
 GS_TRAN    30355   34083   .   -   .   GS_TRAN "Hs22_30444_28_1_1"; Name "Hs22_30444_28_1_1"
 snp    30847   30847   .   +   .   SNP "rs2971719"; Name "rs2971719"
 exon   31012   31409   .   -   .   Transcript "XM_104663"; Name "XM_104663"
 exon   34013   34083   .   -   .   Transcript "XM_104663"; Name "XM_104663"
 exon   40932   41071   .   +   .   Transcript "XM_092971"; Name "XM_092971"
 snp    44269   44269   .   +   .   SNP "rs2873227"; Name "rs2873227"
 snp    45723   45723   .   +   .   SNP "rs2227095"; Name "rs2227095"
 exon   134031  134495  .   -   .   Transcript "XM_086913"; Name "XM_086913"            
 exon   134034  134457  .   -   .   Transcript "XM_086914"; Name "XM_086914"

Looking at the sample data above,only the last two rows can be merged into one row.So,the new row will become.

exon    134031  134495  .   -   .   Transcript "XM_086913"; Name "XM_086913"

In case,the end of the other row would have been greater than its previous,that would be the END region in that case.Basically,if there is any overlap,then take the region which starts Earlier,and the one which ends later.

There can be multiple rows of such instance,here only last 2 rows are there.One thing is that the ATRRIBUTE column will definitely show different Transcript names for such rows,which are mostly same in other cases.

I have to do this in Python,and I am a beginner in Python.

[BioPython](http://biopython.org/wiki/GFF_Parsing) has tools for parsing GFF files. They may be a good place to start. — GWW, Mar 21 '14 at 17:32

score 1 · Answer 1 · answered Mar 21 '14 at 17:57

Break it down to simpler steps:

Read the file and parse it into a list of data
Loop your list and check each row against the next
Append the ones that fullfill your requirements into a new list
Save your new list into a new file or print it to the console

You might want to manually move through the list instead of using a for row in mylist like this:

newlist = []
i = 0
while i < len(mylist):
     if can_collapse( mylist[i], mylist[i+1] ):
         newlist.append[ collapse( mylist[i], mylist[i+1] ) ]
         i += 2
     else:
         newlist.append[ mylist[i] ]
         i += 1

Merge fields in a file

1 Answers1