Split a large gz file into smaller ones filtering and distributing content

Question

I have a gzip file of size 81G which I unzip and size of uncompressed file is 254G. I want to implement a bash script which takes the gzip file and splits it on the basis of the first column. The first column has values range between 1-10. I want to split the files into 10 subfiles where by all rows where value in first column is 1 is put into 1 file. All the rows where the value is 2 in the first column is put into a second file and so on. While I do that I don't want to put column 3 and column 5 in the new subfiles. Also the file is tab separated. For example:

col_1    col_2.   col_3.  col_4.  col_5.  col_6
1.       7464      sam.    NY.     0.738.  28.9
1.       81932.    Dave.   NW.     0.163.  91.9
2.       162.      Peter.  SD.     0.7293. 673.1
3.       7193.     Ooni    GH.     0.746.  6391
3.       6139.     Jess.   GHD.    0.8364. 81937
3.       7291.     Yeldish HD.     0.173.  1973

File above will result in three different gzipped files such that col_3 and col_5 are removed from each of the new subfiles. What I did was

#!/bin/bash
#SBATCH --partition normal
#SBATCH --mem-per-cpu 500G
#SBATCH --time 12:00:00
#SBATCH -c 1



awk -F, '{print > $1".csv.gz"}' file.csv.gz

But this is not producing the desired result. Also I don't know how to remove col_3 and col_5 from the new subfiles. Like I said gzip file is 81G and therefore, I am looking for an efficient solution. Insights will be appreciated.

What's your field separator? Multiple spaces, one tab or one comma? — Cyrus, Dec 19 '20 at 22:11
I know you said you only have 1-10 values so this shouldn't apply but just be aware that any variation of `awk '{print > $1".csv.gz"}'` or `print | ... `that doesn't call `close() `would fail with "too many open files" once you get past a dozen or so output files unless you're using GNU awk and even with GNU awk it'd start to slow down for large numbers of simultaneously open output files as it has to work to manage that internally. Also, if your file is tab-separated then set the FS to tab using `-F'\t'`, not to comma using `-F,`. — Ed Morton, Dec 20 '20 at 15:31

score 3 · Answer 1 · answered Dec 19 '20 at 22:23

3

You have to decompress and recompress; to get rid of columns 3 and 5, you could use GNU cut like this:

gunzip -c infile.gz \
    | cut --complement -f3,5 \
    | awk '{ print | "gzip > " $1 "csv.gz" }'

Or you could get rid of the columns in awk:

gunzip -c infile.gz \
    | awk -v OFS='\t' '{ print $1, $2, $4, $6 | "gzip > " $1 "csv.gz" }'

answered Dec 19 '20 at 22:23

Benjamin W.

46,058
19
106
116

Thank you for your comment. I did what you suggested. However, when I try to upload a file produced I get the following error: **EOFError: Compressed file ended before the end-of-stream marker was reached.** – John Dec 20 '20 at 15:36
@John is it possible you ran out of disk space to save the output files so you ended up with part of a file written? – Ed Morton Dec 20 '20 at 15:45
I opened the first file it created. If it ran out of disk space it shouldn't have created the other files but it does create other files. – John Dec 20 '20 at 17:41
@John It creates each file the first time the corresponding value in column 1 is encountered. – Benjamin W. Dec 20 '20 at 17:49
So how can I proceed because I reserved 500G to process the gzip file. My original gzip file is 81G. That is why I thought of splitting it into smaller files based on first column. – John Dec 20 '20 at 17:58
@John it sounds like you're uploading the files to some other system so if space on your local machine really is the issue then you could generate/upload/remove one output file at a time by something like (pseudocode) `gunzip file | awk -v c='1.' '$1!=c{next} ...' | gzip > foo; upload foo; rm foo`. – Ed Morton Dec 20 '20 at 18:05
@EdMorton I am running everything on a cluster. So I should move the file created to another folder and remove from current folder? – John Dec 20 '20 at 18:11
The first question is - **are** you running out of disk space? No point trying to solve a disk space problem if you don't have one. – Ed Morton Dec 20 '20 at 18:12

score 3 · Answer 2 · answered Dec 19 '20 at 22:26

3

Something like

zcat input.csv.gz | cut -f1,2,4,6- | awk '{ print | ("gzip -c > " $1 "csv.gz") }'

Uncompress the file, remove fields 3 and 5, save to the appropriate compressed file based on the first column.

answered Dec 19 '20 at 22:26

Shawn

47,241
3
26
60

Though personally I'd use zstandard over gzip for any new compressed files. – Shawn Dec 19 '20 at 22:30
Won't that overwrite existing output files, such that each n.csv.gz ends up containing the data from just one input line? – John Bollinger Dec 20 '20 at 15:20
1

@John no, the pipe to the first call to `gzip` will stay open until awk terminates (or `close()` is called on it) so `gzip` is only getting called once per unique `$1`, not once per input line. – Ed Morton Dec 20 '20 at 15:26

Ed Morton · Answer 3 · 2020-12-20T15:46:19.270

Robustly and portably with any awk, if the file is always sorted by the first field as shown in your example:

gunzip -c infile.gz |
awk '
    { $0 = $1 OFS $2 OFS $4 OFS $6 }
    NR==1 { hdr = $0; next }
    $1 != prev { close(gzip); gzip="gzip > \047"$1".csv.gz\047"; prev=$1 }
    !seen[$1]++ { print hdr | gzip }
    { print | gzip }
'

otherwise:

gunzip -c infile.gz |
awk 'BEGIN{FS=OFS="\t"} {print (NR>1), NR, $0}' |
sort -k1,1n -k3,3 -k2,2n |
cut -f3- |
awk '
    { $0 = $1 OFS $2 OFS $4 OFS $6 }
    NR==1 { hdr = $0; next }
    $1 != prev { close(gzip); gzip="gzip > \047"$1".csv.gz\047"; prev=$1 }
    !seen[$1]++ { print hdr | gzip }
    { print | gzip }
'

The first awk adds a number at the front to ensure the header line sorts before the rest during the sort phase, and adds the line number so that lines with the same original first field value retain their original input order. Then we sort by the first field, and then cut away the 2 fields added in the first step, then use awk to robustly and portably create the separate output files, ensuring that each output file starts with a copy of the header. We close each output file as we go so that the script will work for any number of output files using any awk and will work efficiently even for a large number of output files with GNU awk. It also ensures that each output file name is properly quoted to avoid globbing, word splitting, and filename expansion.

Split a large gz file into smaller ones filtering and distributing content

3 Answers3

Linked