3

According to this FAQ on zlib.net it is possible to:

access data randomly in a compressed stream

I know about the module Bio.bgzf of Biopyton 1.60, which:

supports reading and writing BGZF files (Blocked GNU Zip Format), a variant of GZIP with efficient random access, most commonly used as part of the BAM file format and in tabix. This uses Python’s zlib library internally, and provides a simple interface like Python’s gzip library.

But for my use case I don't want to use that format. Basically I want something, which emulates the code below:

import gzip
large_integer_new_line_start = 10**9
with gzip.open('large_file.gz','rt') as f:
    f.seek(large_integer_new_line_start)

but with the efficiency offered by the native zlib.net to provide random access to the compressed stream. How do I leverage that random access capability in Python?

Braiam
  • 1
  • 11
  • 47
  • 78
tommy.carstensen
  • 8,962
  • 15
  • 65
  • 108
  • Why don't you want to use a bgzip compressed file? Bgzip is valid gzip... – winni2k Sep 26 '17 at 13:13
  • @wkretzsch I do want to use bgzip. I asked the question more than 3 years ago, so I can't quite remember the details. Probably the files I was working with were gzipped and not bgzipped. – tommy.carstensen Sep 26 '17 at 23:40

4 Answers4

7

I gave up on doing random access on a gzipped file using Python. Instead I converted my gzipped file to a block gzipped file with a block compression/decompression utility on the command line:

zcat large_file.gz | bgzip > large_file.bgz

Then I used BioPython and tell to get the virtual_offset of line number 1 million of the bgzipped file. And then I was able to rapidly seek the virtual_offset afterwards:

from Bio import bgzf

file='large_file.bgz'

handle = bgzf.BgzfReader(file)
for i in range(10**6):
    handle.readline()
virtual_offset = handle.tell()
line1 = handle.readline()
handle.close()

handle = bgzf.BgzfReader(file)
handle.seek(virtual_offset)
line2 = handle.readline()
handle.close()

assert line1==line2

I would like to also point to the SO answer by Mark Adler here on examples/zran.c in the zlib distribution.

Community
  • 1
  • 1
tommy.carstensen
  • 8,962
  • 15
  • 65
  • 108
  • Very useful, thanks for sharing the snippet! I will follow a similar path then. Considering that its a 9 years old post, please considering sharing if you have found a better option :D – Amin.A Oct 26 '22 at 08:57
0

You are looking for dictzip.py, part of the serpento package. However, you have to compress the files with dictzip, which is a random seekable backward compatible variant of the gzip compression.

0

The indexed_gzip program might be what you wanted. It also uses zran.c under the hood.

mxmlnkn
  • 1,887
  • 1
  • 19
  • 26
-3

If you just want to access the file from a random point can't you just do:

from random import randint

with open(filename) as f:
    f.seek(0, 2)
    size = f.tell()
    f.seek(randint(0, size), 2)
Scorpion_God
  • 1,499
  • 10
  • 15
  • @scorpion-god Thanks. I don't want to read from a random point. I want the possibility to access any point in the gzip compressed file without having to read through the entire deflated stream. – tommy.carstensen Apr 08 '14 at 23:31
  • 1
    @tommy.carstensen you can't teleport. – Scorpion_God Apr 08 '14 at 23:37
  • 2
    @tommy.carstensen More to the point: this is why they created BGZF because by default gzipped does not support this. You still have to read the entire thing once to build an index. More info: http://stackoverflow.com/questions/14225751/random-access-to-gzipped-files – metatoaster Apr 08 '14 at 23:43
  • @metatoaster Thanks. I don't mind reading the entire thing once to build an index. I will use BGZF, if I can figure out how to use it on any file format of my choice. It seems to be restricted to a pre-selected set of file formats. – tommy.carstensen Apr 09 '14 at 01:00