I have a file with 7 columns, a GFF file having chromosomal regions.I want to collapse the rows where REGION ="exon" to only one row in the file.The row has to be collapsed on the basis of regions being overlapping with each other.
REGION START END SCORE STRAND FRAME ATTRIBUTE
exon 26453 26644 . + . Transcript "XM_092971"; Name "XM_092971"
exon 26842 27020 . + . Transcript "XM_092971"; Name "XM_092971"
exon 30355 30899 . - . Transcript "XM_104663"; Name "XM_104663"
GS_TRAN 30355 34083 . - . GS_TRAN "Hs22_30444_28_1_1"; Name "Hs22_30444_28_1_1"
snp 30847 30847 . + . SNP "rs2971719"; Name "rs2971719"
exon 31012 31409 . - . Transcript "XM_104663"; Name "XM_104663"
exon 34013 34083 . - . Transcript "XM_104663"; Name "XM_104663"
exon 40932 41071 . + . Transcript "XM_092971"; Name "XM_092971"
snp 44269 44269 . + . SNP "rs2873227"; Name "rs2873227"
snp 45723 45723 . + . SNP "rs2227095"; Name "rs2227095"
exon 134031 134495 . - . Transcript "XM_086913"; Name "XM_086913"
exon 134034 134457 . - . Transcript "XM_086914"; Name "XM_086914"
Looking at the sample data above,only the last two rows can be merged into one row.So,the new row will become.
exon 134031 134495 . - . Transcript "XM_086913"; Name "XM_086913"
In case,the end of the other row would have been greater than its previous,that would be the END region in that case.Basically,if there is any overlap,then take the region which starts Earlier,and the one which ends later.
There can be multiple rows of such instance,here only last 2 rows are there.One thing is that the ATRRIBUTE column will definitely show different Transcript names for such rows,which are mostly same in other cases.
I have to do this in Python,and I am a beginner in Python.