2

I'm trying to combine multiple (29) compressed files (.gz), one after the other, into one file.

The compressed files are around 500MB and in their uncompressed format ~30GB. All the files start with a header that I don't want in the final file.

I have tried to do it using zcatand gzip, but it takes a lot of time (more than 3hours):

 zcat file*.gz | tail -n +2 | gzip -c >> all_files.txt.gz 

I have also tried it with pigz:

 unpigz -c file*.gz | tail -n +2 | pigz -c >> all_files_pigz.txt.gz 

In this case, I'm working in a cluster where they don't have this command and I can't install anything.

The last thing I have tried is to merge all with cat:

 cat file*.gz > all_files_cat.txt.gz

It doesn't take a lot of time, but when I'm going to read it, at some pint appears the following message:

 gzip: unexpected end of file

How could I deal with this?

Marta_ma
  • 95
  • 1
  • 9
  • 1
    Sadly, unless the files are compressed with BGZF or another indexable gzip format, then the only way to strip away that content is to unpack and repack the whole file. As you say, pigz or libdeflate will be faster than gzip. You could also look at doing the unpacking in parallel. In the end if you do this a lot it may be worth investigating bgzf. – Gem Taylor Sep 05 '19 at 13:46
  • @GemTaylor Thanks Gem! I'll take a look – Marta_ma Sep 05 '19 at 15:11

1 Answers1

2

If you want to remove the first line of every uncompressed file, and concatenate them all into one compressed file, you'll need a loop. Something like

for f in file*.gz; do
  zcat "$f" | tail -n +2
done | gzip -c > all_files_cat.txt.gz

If there's lots of big files, yes, it can take a while. Maybe use a lower compression level than the default (At the expense of larger file size). Or use a different compression program than gzip; there are lots of options, each with their own speed and compression ratio tradeoffs.

Shawn
  • 47,241
  • 3
  • 26
  • 60
  • but with the wildcard, `*`, I take all the files with that string, isn't it? – Marta_ma Sep 05 '19 at 08:40
  • @Marta_ma It's the same pattern you used in your question. If that's not what matches your files, change it to something appropriate. Just make sure that your destination file doesn't match the pattern. – Shawn Sep 05 '19 at 08:43
  • what I wanted to say is that since with the wildcard I take all the files that I want, I don't need a for loop – Marta_ma Sep 05 '19 at 08:46
  • @Marta_ma If you want to remove the first line of every file you do. *All the files start with a header that I don't want in the final file.* If that's not the case, edit your question to clarify. – Shawn Sep 05 '19 at 08:47
  • you are right, I want, but it's my minor problem, because a can do it later with `sed` or another command – Marta_ma Sep 05 '19 at 08:52