2

I have a large gz file (11 GB) that I can't decompress to my computer with even 100GB free. I've extracted the first 50 GB with the command:

gzip -cd file.gz | dd ibs=1024 count=50000000 > first_50_GB_file.txt

I was able to successfully parse my data from this portion of the file. Now I want to extract the other portion of the file to parse. I've tried to extract the last n lines from the file and then to decompress that as follows:

tail -50 file.gz > last_part_of_file.gz

I hoped that afterwards, I could use:

gzip -cd last_part_of_file.gz | dd ibs=1024 count=50000000 > last_50_GB_file.txt

but the tail command is taking >10 minutes for a test of only 50 lines.

If anyone has any solutions on how to extract (potentially arbitrary) portions of a .gz file that do not include the beginning I would be very grateful.

1 Answers1

3

tail can't work with binary file ; tail -50 returns the last 50 lines looking for '\n' (char 10) delimiter.

gzip -cd file.gz | dd ibs=1024 count=50000000 > first_50_GB_file.txt

gzip -cd file.gz | dd ibs=1024 skip=50000000 > after_50_GB_file.txt

I though first the extracted file size was 100GB. To limit space to 50GB

gzip -cd file.gz | dd ibs=1024 skip=50000000 count=50000000 > next_50-100_GB_file.txt

for next 50GB

gzip -cd file.gz | dd ibs=1024 skip=100000000 count=50000000 > next_100-150_GB_file.txt

but each time gzip process must inflate from the beginning of the archive file due to compression algorithm.

Nahuel Fouilleul
  • 18,726
  • 2
  • 31
  • 36
  • Thanks, now I understand why tail wasn't working. I tried this and didn't have much success. Using 'gzip -cd file.gz | dd ibs=1024 skip=50000000 > after_50_GB_file.txt' Took up all of the space on my disk. So I assumed that I would have to tell the command to stop after a certain number of blocks. I then tried: 'gzip -cd file.gz | dd ibs=1024 skip=49000000 count=50000000 > after_49_GB_next_50GB.txt' and this produced a file of 90 GB. Do you know what might be going on? – Will Gibson May 10 '17 at 12:33
  • what did you get ? – Nahuel Fouilleul May 10 '17 at 12:37
  • I was able to get it to work with: gzip -cd file.gz | dd ibs=1024 skip=49000000 count=50000000 of=after_49GB_next_50_GB_file.txt Thank you for your help! – Will Gibson May 10 '17 at 14:19
  • in fact as first 50GB have already been extracted exact dd parameters are skip=50000000 count=50000000 – Nahuel Fouilleul May 10 '17 at 14:23