How to obtain random access of a gzip compressed file

Question

According to this FAQ on zlib.net it is possible to:

access data randomly in a compressed stream

I know about the module Bio.bgzf of Biopyton 1.60, which:

supports reading and writing BGZF files (Blocked GNU Zip Format), a variant of GZIP with efficient random access, most commonly used as part of the BAM file format and in tabix. This uses Python’s zlib library internally, and provides a simple interface like Python’s gzip library.

But for my use case I don't want to use that format. Basically I want something, which emulates the code below:

import gzip
large_integer_new_line_start = 10**9
with gzip.open('large_file.gz','rt') as f:
    f.seek(large_integer_new_line_start)

but with the efficiency offered by the native zlib.net to provide random access to the compressed stream. How do I leverage that random access capability in Python?

Why don't you want to use a bgzip compressed file? Bgzip is valid gzip... — winni2k, Sep 26 '17 at 13:13
@wkretzsch I do want to use bgzip. I asked the question more than 3 years ago, so I can't quite remember the details. Probably the files I was working with were gzipped and not bgzipped. — tommy.carstensen, Sep 26 '17 at 23:40

score 7 · Answer 1 · edited May 23 '17 at 12:19

I gave up on doing random access on a gzipped file using Python. Instead I converted my gzipped file to a block gzipped file with a block compression/decompression utility on the command line:

zcat large_file.gz | bgzip > large_file.bgz

Then I used BioPython and tell to get the virtual_offset of line number 1 million of the bgzipped file. And then I was able to rapidly seek the virtual_offset afterwards:

from Bio import bgzf

file='large_file.bgz'

handle = bgzf.BgzfReader(file)
for i in range(10**6):
    handle.readline()
virtual_offset = handle.tell()
line1 = handle.readline()
handle.close()

handle = bgzf.BgzfReader(file)
handle.seek(virtual_offset)
line2 = handle.readline()
handle.close()

assert line1==line2

I would like to also point to the SO answer by Mark Adler here on examples/zran.c in the zlib distribution.

Very useful, thanks for sharing the snippet! I will follow a similar path then. Considering that its a 9 years old post, please considering sharing if you have found a better option :D — Amin.A, Oct 26 '22 at 08:57

score 0 · Answer 2 · answered Jun 29 '16 at 07:02

0

You are looking for dictzip.py, part of the serpento package. However, you have to compress the files with dictzip, which is a random seekable backward compatible variant of the gzip compression.

answered Jun 29 '16 at 07:02

Radovan Garabík

143
1
4

score 0 · Answer 3 · answered Mar 09 '19 at 15:24

0

The indexed_gzip program might be what you wanted. It also uses zran.c under the hood.

answered Mar 09 '19 at 15:24

mxmlnkn

1,887
1
19
26

Scorpion_God · Answer 4 · 2014-04-08T23:37:38.597

-3

If you just want to access the file from a random point can't you just do:

from random import randint

with open(filename) as f:
    f.seek(0, 2)
    size = f.tell()
    f.seek(randint(0, size), 2)

edited Apr 08 '14 at 23:37

answered Apr 08 '14 at 23:27

Scorpion_God

1,499
10
15

@scorpion-god Thanks. I don't want to read from a random point. I want the possibility to access any point in the gzip compressed file without having to read through the entire deflated stream. – tommy.carstensen Apr 08 '14 at 23:31
1

@tommy.carstensen you can't teleport. – Scorpion_God Apr 08 '14 at 23:37
2

@tommy.carstensen More to the point: this is why they created BGZF because by default gzipped does not support this. You still have to read the entire thing once to build an index. More info: http://stackoverflow.com/questions/14225751/random-access-to-gzipped-files – metatoaster Apr 08 '14 at 23:43
@metatoaster Thanks. I don't mind reading the entire thing once to build an index. I will use BGZF, if I can figure out how to use it on any file format of my choice. It seems to be restricted to a pre-selected set of file formats. – tommy.carstensen Apr 09 '14 at 01:00

How to obtain random access of a gzip compressed file

4 Answers4

Linked