The crux here is that this is a huge file. My goal is to avoid reading the entire file into memory at once, AND avoid parsing every line in a loop to get to the line I need (because it takes forever. The file is literally 15 million lines long).
What I'm currently doing is opening the file as...
self._FH = gzip.open(filename, "rb")
...moving the pointer directly to the location of the needed line (using many shenanigans, but it works) and reading in the individual line.
The lines similar to below (although these examples come from the beginning of the file, for ease and information sake)...
b'BAM\x01\x17\x18\x00\x00@HD\tVN:1.0\tSO:coordinate\n'
b'@SQ\tSN:1\tLN:248956422\n'
b'@SQ\tSN:10\tLN:133797422\n'
b'@SQ\tSN:11\tLN:135086622\n'
b'@SQ\tSN:12\tLN:133275309\n'
b'@SQ\tSN:13\tLN:114364328\n'
b'@SQ\tSN:14\tLN:107043718\n'
b'@SQ\tSN:15\tLN:101991189\n'
b'@SQ\tSN:16\tLN:90338345\n'
b'@SQ\tSN:17\tLN:83257441\n'
b'@SQ\tSN:18\tLN:80373285\n'
Some might notice this is a BAM
file, so if there's a better way to do this, suggestions welcome ...although the samtools
filters won't accomplish what I need. I have to seek by line, not by data.