create new column containing a substring of an existing column in data using bash

Question

I have a large tsv.gz file (40GB) for which I want to extract a string from an existing variable col3, store it in a new variable New_var (placed at the beginning) and save everything the new file. an example of the data "old_file.tsv.gz"

col1  col2  col3  col4
1  positive  12:1234A  100
2  negative  10:9638B  110
3  positive  5:0987A  100
4  positive  8:5678A  170

Desired data "new_file.tsv.gz"

New_var  col1  col2  col3  col4
12  1  positive  12:1234A  100
10  2  negative  10:9638B  110
5  3  positive  5:0987A  100
8  4  positive  8:5678A  170

I am new in bash so I have tried multiple things but I get stuck, I have tried

zcat old_file.tsv.gz | awk '{print New_var=$3,$0 }' | awk '$1 ~ /^[0-9]:/{print $0 | (gzip -c > new_file.tsv.gz) }'

I think I have multiple problems. {print New_var=$3,$0 } do create a duplicate of col3 but doesn't rename it. Then when I add the last part of the code awk '$1 ~ /^[0-9]:/{print $0 | (gzip -c > new_file.tsv.gz) }'...well nothing comes up (I tried to look if I forget a parenthesis but cannot find the problem). Also I am not sure if this way is the best way to do it. Any idea how to make it work?

There is no question in your question. (Hint: questions end with a question mark.) Also, what have you tried? Where did you run into a problem? — Mark Adler, May 16 '22 at 21:25
Sorry I have tried so many things that I am unsure that it is difficult to say where I am at. I updated my post, I hope it can help. My question is going back to "how to make this thing happen?" — RCchelsie, May 16 '22 at 21:55
Why do you have a backtick before `$1` ? Why is your `gzip` inside your awk script? — jhnc, May 16 '22 at 22:19
@jhnc the backtick was a mistake and is corrected. Thanks for catching it! The gzip comes from solution I found here https://stackoverflow.com/questions/65374943/split-a-large-gz-file-into-smaller-ones-filtering-and-distributing-content and used previously. — RCchelsie, May 16 '22 at 22:34

score 1 · Accepted Answer · answered May 16 '22 at 22:54

1

Make an AWK script in a separate file (for readability), say 1.awk:

{ if (NR > 1) { 
    # all data lines 
    split($3, a, ":");  
    print a[1], $1, $3, $3, $4; 
  } else {
    # header line
    print "new_var", $1, $2, $3, $4;
  } 
}

Now process the input (say 1.csv.gz) with the AWK file:

zcat 1.csv.gz | awk -f 1.awk | gzip -c > 1_new.csv.gz

answered May 16 '22 at 22:54

Michail Alexakis

1,405
15
14

It works!! Thank you very much! I didn't know about the split function. – RCchelsie May 17 '22 at 00:45
Yes, `split` is quite useful if you want to process a field. By the way there a lot of [string functions](https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html) available in AWK. – Michail Alexakis May 17 '22 at 06:43
Thanks that is really useful! I am using R but very little of bash. Question: I realize that with this code, the data is not parsed correctly. I think it is because I have to use `awk 'BEGIN{FS=OFS="\t"}` somewhere but can't figure out where. Do you have an idea? – RCchelsie May 17 '22 at 17:04
You can use the `-v` flag of AWK to set variables in startup (even before the BEGIN block). For example, to set the input/output field separators, you can do: `awk -v FS='\t' -v OFS='\t' -f 1.awk ...` – Michail Alexakis May 17 '22 at 17:32

score 1 · Answer 2 · answered May 16 '22 at 22:54

1

I suggest to use one tab (\t) and : as input field separator:

awk 'BEGIN { FS="[\t:]"; OFS="\t" }
     NR==1 { $1="New_var" OFS $1 }
     NR>1  { $0=$3 OFS $0 }
     { print }'

As one line:

awk 'BEGIN{ FS="[\t:]"; OFS="\t" } NR==1{ $1="New_var" OFS $1 } NR>1{ $0=$3 OFS $0 } { print }'

See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR

answered May 16 '22 at 22:54

Cyrus

84,225
14
89
153

Thank for the help and for the link! It works well! Unfortunately I cannot use it for my larger data because I have the character ":" in other column that I don't want to be parsed. – RCchelsie May 17 '22 at 17:07

create new column containing a substring of an existing column in data using bash

2 Answers2