I have a large bzip2 compressed file on S3 and I'm only interested in its first line. How can I read the first line(s) without downloading and decompressing the entire file?
Asked
Active
Viewed 251 times
1 Answers
0
import boto3
import io
import bz2
s3 = boto3.resource('s3')
s3_object = s3.Object("bucket-name", "path/file.bz2")
f_bz2 = s3_object.get(Range=f"bytes=0-100000")["Body"].read()
io_bz2 = io.BytesIO(f_bz2)
lines = []
with bz2.BZ2File(io_bz2, "r") as f:
while True:
lines.append(f.readline())
The compression block size for bzip2 ranges between 100kb and 900kb. Above code assumes 100kb.
In the end an exception is thrown:
EOFError: Compressed file ended before the end-of-stream marker was reached