Is there a way to skip first x lines of a bz2 file in Python without calling next()?

Question

I'm trying to read the latest Wikidata dump while skipping the first, say, 100 lines.

Is there a better way to do this than calling next() repeatedly?

WIKIDATA_JSON_DUMP = bz2.open('latest-all.json.bz2', 'rt')

for n in range(100):
    next(WIKIDATA_JSON_DUMP)

Alternatively, is there a way to split up the file in bash by, say, using bzcat to pipe select chunks to smaller files?

score 1 · Answer 1 · answered Jul 06 '21 at 14:24

1

If it was compressed using something like bgzip, you can skip blocks, but they will contain a variable number of lines, depending on the compression ratio. For raw bzip files which are a single stream, I don't think you have any choice but to read and throw away the lines to be skipped, due to the nature of the compression format.

answered Jul 06 '21 at 14:24

Tom Morris

10,490
32
53

bzip2 has a similar block structure that allows skipping or parallel decompression of individual blocks–with the identical problem that blocks correspond to unknown amounts of uncompressed data. See last paragraph in this section: https://en.wikipedia.org/wiki/Bzip2#File_format – Matthias Winkelmann Jul 30 '21 at 00:17

score 1 · Answer 2 · answered Jul 11 '21 at 14:55

1

You can try the following in bash, to skip the first 10 lines for example:

bzcat -d -c /tmp/myfile.bz2 | tail -n +11

Notice the tail gets the N+1 number of lines you want to skip.

answered Jul 11 '21 at 14:55

Pineapples

56
3

Is there a way to skip first x lines of a bz2 file in Python without calling next()?

2 Answers2