I have a gzip file of size 81G which I unzip and size of uncompressed file is 254G. I want to implement a bash script which takes the gzip file and splits it on the basis of the first column. The first column has values range between 1-10. I want to split the files into 10 subfiles where by all rows where value in first column is 1 is put into 1 file. All the rows where the value is 2 in the first column is put into a second file and so on. While I do that I don't want to put column 3 and column 5 in the new subfiles. Also the file is tab separated. For example:
col_1 col_2. col_3. col_4. col_5. col_6
1. 7464 sam. NY. 0.738. 28.9
1. 81932. Dave. NW. 0.163. 91.9
2. 162. Peter. SD. 0.7293. 673.1
3. 7193. Ooni GH. 0.746. 6391
3. 6139. Jess. GHD. 0.8364. 81937
3. 7291. Yeldish HD. 0.173. 1973
File above will result in three different gzipped files such that col_3 and col_5 are removed from each of the new subfiles. What I did was
#!/bin/bash
#SBATCH --partition normal
#SBATCH --mem-per-cpu 500G
#SBATCH --time 12:00:00
#SBATCH -c 1
awk -F, '{print > $1".csv.gz"}' file.csv.gz
But this is not producing the desired result. Also I don't know how to remove col_3 and col_5 from the new subfiles. Like I said gzip file is 81G and therefore, I am looking for an efficient solution. Insights will be appreciated.