0

I'm trying to read the latest Wikidata dump while skipping the first, say, 100 lines.

Is there a better way to do this than calling next() repeatedly?

WIKIDATA_JSON_DUMP = bz2.open('latest-all.json.bz2', 'rt')

for n in range(100):
    next(WIKIDATA_JSON_DUMP)

Alternatively, is there a way to split up the file in bash by, say, using bzcat to pipe select chunks to smaller files?

zadrozny
  • 1,631
  • 3
  • 22
  • 27

2 Answers2

1

If it was compressed using something like bgzip, you can skip blocks, but they will contain a variable number of lines, depending on the compression ratio. For raw bzip files which are a single stream, I don't think you have any choice but to read and throw away the lines to be skipped, due to the nature of the compression format.

Tom Morris
  • 10,490
  • 32
  • 53
  • bzip2 has a similar block structure that allows skipping or parallel decompression of individual blocks–with the identical problem that blocks correspond to unknown amounts of uncompressed data. See last paragraph in this section: https://en.wikipedia.org/wiki/Bzip2#File_format – Matthias Winkelmann Jul 30 '21 at 00:17
1

You can try the following in bash, to skip the first 10 lines for example:

bzcat -d -c /tmp/myfile.bz2 | tail -n +11

Notice the tail gets the N+1 number of lines you want to skip.

Pineapples
  • 56
  • 3