4

I have a .bz2 file. I want to list the first or last 10 lines without decompress it as it is too big. I tried the head -10 or tail -10 but I see gibberish. I also need to compare two compressed file to check if they are similar or not. how to achieve this without decompressing the files ?

EDIT: Similar means identical (have the same content).

Wiliam A
  • 457
  • 3
  • 7
  • 10
  • What do you mean by similar? BZ2 is a block-based format, so it *is* possible to decompress only small chunks of a file without reading the whole. – ypnos Feb 08 '13 at 14:12
  • For comparing two compressed files, you may find something on this page, although the question does specifically asks for `.zip` files...: http://stackoverflow.com/questions/587442/is-there-a-safe-way-to-run-a-diff-on-two-zip-compressed-files – ajp15243 Feb 08 '13 at 14:14
  • The files must get decompressed. I think what you're actually asking is "without having to save a copy of the decompressed file." – Andy Lester Feb 08 '13 at 15:37

2 Answers2

10

While bzip2 is a block-based compression algorithm, so in theory you could just find the particular blocks you want to decompress, this would be complicated (e.g. what if the last ten lines you ultimately want to see actually spans two or more compressed blocks?).

To answer your immediate question, you can do this, which does actually decompress the entire file, so is in a sense wasteful, but it doesn't try to store that file anywhere, so you don't run into storage capacity issues:

bzcat file.bz2 | head -10
bzcat file.bz2 | tail -10

If your distribution doesn't include bzcat (which would be a bit unusual in my experience), bzcat is equivalent to bzip2 -d -c.

However, if your ultimate goal is to compare two compressed files (that may have been compressed at different levels, and so comparing the actual compressed files directly doesn't work), you can do this (assuming bash as your shell):

cmp <(bzcat file1.bz2) <(bzcat file2.bz2)

This will decompress both files and compare the uncompressed data byte-by-byte without ever storing either of the decompressed files anywhere.

twalberg
  • 59,951
  • 11
  • 89
  • 84
  • 1
    `bzcat | head` will not decompress the entire file. When `head` terminates it closes the pipe and `bzcat` gets a `SIGPIPE`. `bzcat | tail` will decompress the entire file though. – Daniel Darabos Aug 13 '19 at 16:00
0

The plain standard bunzip2 command can't do this. However, the man page says that bzip2 works in blocks of 900 KB, and mentions bzip2recover which is a program that can decompress individual blocks.

Using that knowledge, you should be able to put together something that cuts off the first and last 900 KB (or so) from the desired file, and then uses bzip2recover to decompress those.

unwind
  • 391,730
  • 64
  • 469
  • 606
  • The problem with this is, depending on the arguments given to originally compress the file, the block size is *up to* 900KB of the original uncompressed data. How that corresponds to locations within the compressed file is highly data-dependent and difficult to predict. – twalberg Feb 08 '13 at 15:34