0

I have a bam file that looks like this:

samtools view pingpon.forward.bam | head
K00311:84:HYCNTBBXX:1:1123:2909:4215    0   LQNS02000001.1:55-552   214 28M *   0   0   TCTAGTTCAACTGTAAATCATCCTGCCC    AAFFFJJJJJJJJJJJJJJJJJJJJJJJ    AS:i:-6 XS:i:-6 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:9T18   YT:Z:UU
K00311:84:HYCNTBBXX:1:1123:2909:4215    0   LQNS02000001.1:55-552   214 28M *   0   0   TCTAGTTCAACTGTAAATCATCCTGCCC    AAFFFJJJJJJJJJJJJJJJJJJJJJJJ    AS:i:-6 XS:i:-6 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:9T18   YT:Z:UU
K00311:84:HYCNTBBXX:1:1123:2909:4215    0   LQNS02000001.1:55-552   214 28M *   0   0   TCTAGTTCAACTGTAAATCATCCTGCCC    AAFFFJJJJJJJJJJJJJJJJJJJJJJJ    AS:i:-6 XS:i:-6 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:9T18   YT:Z:UU
K00311:84:HYCNTBBXX:1:1123:2909:4215    0   LQNS02000001.1:55-552   214 28M *   0   0   TCTAGTTCAACTGTAAATCATCCTGCCC    AAFFFJJJJJJJJJJJJJJJJJJJJJJJ    AS:i:-6 XS:i:-6 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:9T18   YT:Z:UU
K00311:84:HYCNTBBXX:1:1123:2909:4215    0   LQNS02000001.1:55-552   214 28M *   0   0   TCTAGTTCAACTGTAAATCATCCTGCCC    AAFFFJJJJJJJJJJJJJJJJJJJJJJJ    AS:i:-6 XS:i:-6 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:9T18   YT:Z:UU

I also have another file with the IDs I am interested in that looks like this:

K00311:84:HYCNTBBXX:1:2223:15798:5692
K00311:84:HYCNTBBXX:2:2211:11414:30696
K00311:84:HYCNTBBXX:2:2223:28879:41581

Ideally I want to extract the lines from the bam file that start with the IDs from the IDs file. At the moment I am using this code I wrote but it's not working. Any help will be appreciated! Thanks

import pysam
import re


forward = pysam.AlignmentFile('pingpon.forward.bam', "rb")
reverse = pysam.AlignmentFile('pingpon.reverse.bam', "rb")

ids = open("IDs_results_bed_reverse.txt", "w")

for line in reverse:
        if re.match("(.*)(I|i)ds(.*)", line):
            print(line)

2 Answers2

0

the first with statement is to read the id file and create a dictionary with the entries. The last with will read the entries from the other file and put in the proper entry on the dictionary.

import re

regex = re.compile('[A-Z0-9]+:\d+:[A-Z0-9]+:\d+:\d+:\d+:\d+')

with open('id.bam') as file:
    ids = {}
    for line in file:
        if regex.match(line):
            temp = line.replace('\n', '')
            ids[temp] = []

print(ids)

with open('list.bam') as file:
    for line in file:
        if regex.match(line):
            temp = line.replace('\n', '').split(' ')
            if temp[0] in ids:
                ids[temp[0]].append(line.replace('\n', ''))

print(ids)
  • Thanks for the help! However, this code didn't really work. The output looks like this: 'K00311:84:HYCNTBBXX:2:1117:17919:8453': [], ....... – Amaranta_Remedios Mar 31 '20 at 20:57
  • Yah, thats because no id on the id_file is on the data file, if you add K00311:84:HYCNTBBXX:1:1123:2909:4215 to the ids file you can see the difference – Lucas de Paula Mar 31 '20 at 21:04
  • I don't understands, sorry, because that's exactly what I already got on the ID file. All the IDs. – Amaranta_Remedios Mar 31 '20 at 21:07
  • I think i didn't get what the ids are – Lucas de Paula Mar 31 '20 at 21:34
  • The IDs are the lines I want to extract from the bam file. I got a large bam file and I want to extract only the lines that start with a specific name preset in the IDs file. So I just want to fish out some lines from the bam file – Amaranta_Remedios Apr 01 '20 at 20:54
0

https://www.biostars.org/p/165090/ Someone had a similar question and it got answered here.