1

The crux here is that this is a huge file. My goal is to avoid reading the entire file into memory at once, AND avoid parsing every line in a loop to get to the line I need (because it takes forever. The file is literally 15 million lines long).

What I'm currently doing is opening the file as...

self._FH = gzip.open(filename, "rb")

...moving the pointer directly to the location of the needed line (using many shenanigans, but it works) and reading in the individual line.

The lines similar to below (although these examples come from the beginning of the file, for ease and information sake)...

b'BAM\x01\x17\x18\x00\x00@HD\tVN:1.0\tSO:coordinate\n'
b'@SQ\tSN:1\tLN:248956422\n'
b'@SQ\tSN:10\tLN:133797422\n'
b'@SQ\tSN:11\tLN:135086622\n'
b'@SQ\tSN:12\tLN:133275309\n'
b'@SQ\tSN:13\tLN:114364328\n'
b'@SQ\tSN:14\tLN:107043718\n'
b'@SQ\tSN:15\tLN:101991189\n'
b'@SQ\tSN:16\tLN:90338345\n' 
b'@SQ\tSN:17\tLN:83257441\n'
b'@SQ\tSN:18\tLN:80373285\n'

Some might notice this is a BAM file, so if there's a better way to do this, suggestions welcome ...although the samtools filters won't accomplish what I need. I have to seek by line, not by data.

RightmireM
  • 2,381
  • 2
  • 24
  • 42
  • Most compression algorithms are fundamentally incapable of decompressing from an arbitrary point; you have to decompress every single byte prior to that point whether you intend to use those bytes or not. Consider doing a one-time import of this file into something like a database that allows you to directly retrieve individual lines. – jasonharper Nov 10 '17 at 19:17
  • "*What I'm currently doing is ...*" Is what you are currently doing working well for you? In what way is your current solution problematic? – Robᵩ Nov 10 '17 at 19:21
  • Well, is something preventing you from using more than one file ? – Ebbe M. Pedersen Nov 10 '17 at 21:11

2 Answers2

4

A simple approach would be to take advantage of the fact that a concatenation of valid gzip streams is a gzip stream. Then when compressing you can compress chunks of lines into individual gzip streams and note the starting location of the gzip stream in the file, and the line number of the first line compressed in that stream. Then you can just jump to that location and start decompressing from there. If your chunks are on the order of a megabyte (around 50,000 lines), then there should be relatively little reduction in the compression ratio. Then on average you would need to decompress 25,000 lines to get to any given line, instead of 7.5 million lines.

If you are not in control of the creation of the gzip file, and can't recreate it to your needs, then you can index an existing gzip file using the approach used in zran.c. You can specify how close you want your access points to be and it will build an index that allows access starting at each of those points. You would also need to build an index to your line starts (as you would for an uncompressed file), to associate those with byte offsets into the uncompressed data.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
2

You won't be able to pinpoint a specif line for random access in a gzip file, but you might make use of an index in the compressed file and then pinpoint a block with 1000 lines or something. indexed-gzip might be an option.

However, looking at the data makes me wonder if you can't just do the compression by hand. If you make your compression to a fixed length, you can calculate where each line start in the file, and then just read from that position. It seems that each line can be represented by just two numbers. Or don't I understand the format ?

Ebbe M. Pedersen
  • 7,250
  • 3
  • 27
  • 47
  • If it's compressed, how do you identify the beginning and end of a "line"? Not with a newline character. I guess the "index" can include a pair of integers for the start and end byte offsets of the compressed segment. – President James K. Polk Nov 10 '17 at 20:25
  • The index allows you to jump start at various placed in the file and then stream from there, avoiding having to stream all the way from from the start. It don't help you with identifying where a specif line starts, but then again, neither does an uncompressed file do. – Ebbe M. Pedersen Nov 10 '17 at 21:01
  • Good suggestion on the fixed-length compression, unfortunately I don't have control of the file creation. I gets what they gives me :) – RightmireM Nov 11 '17 at 16:58
  • @James K Polk Actually, I have been using the newline as the indicator. I was surprised that it would work actually, but it seems to be consistent. – RightmireM Nov 11 '17 at 16:59