0

I have a large bzip2 compressed file on S3 and I'm only interested in its first line. How can I read the first line(s) without downloading and decompressing the entire file?

Raffael
  • 19,547
  • 15
  • 82
  • 160

1 Answers1

0
import boto3
import io
import bz2

s3 = boto3.resource('s3')

s3_object = s3.Object("bucket-name", "path/file.bz2")
f_bz2 = s3_object.get(Range=f"bytes=0-100000")["Body"].read()
io_bz2 = io.BytesIO(f_bz2)

lines = []
with bz2.BZ2File(io_bz2, "r") as f:
    while True:
        lines.append(f.readline())

The compression block size for bzip2 ranges between 100kb and 900kb. Above code assumes 100kb.

In the end an exception is thrown:

EOFError: Compressed file ended before the end-of-stream marker was reached
Dunedan
  • 7,848
  • 6
  • 42
  • 52
Raffael
  • 19,547
  • 15
  • 82
  • 160