2

In a file that has a particular column information I want to remove exactly 5 fields (i.e :PG:PB:PI:PW:PC (separator is ':') from the end of the lines, not from the beginning.

GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC
GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC
GT:AD:DP:GQ:PGT:PID:PL:PG:PB:PI:PW:PC
GT:AD:DP:GQ:PGT:PID:PL:PG:PB:PI:PW:PC

Assuming that the above data is from column #3 of the file, I wrote the following code:

awk 'BEGIN{FS=OFS="\t"} { split($3, a,":")} {print ($1, $2, a[1]":"a[2]":"a[3]":"a[4]":"a[5])}' awk_test.vcf

This code splits and selects the first 5 fields, but I want to remove the last 5 fields. Selecting from the first fields won't work since certain fields like PGT , PID are inserted in certain lines. Only, removing from the end works.

Expected output:

GT:AD:DP:GQ:PL
GT:AD:DP:GQ:PL
GT:AD:DP:GQ:PGT:PID:PL
GT:AD:DP:GQ:PGT:PID:PL

Thanks for helping me with the code for first part of my question.

But, the script isn't working for my another file which has the following data. Here I want to update the 9th column with the same purpose. The columns are tab separated. But, what I want to do remains basically the same.

2   1463    .   T   TG  433.67  PASS    AC=0;AF=0.00;AN=0;BaseQRankSum=-4.310e-01;ClippingRankSum=0.00;DP=247;ExcessHet=2.9800;FS=0.000;MQ=21.25;MQRankSum=0.00;QD=33.36;ReadPosRankSum=-6.740e-01;SOR=0.784;set=InDels GT:AD:DP:PL:PG:PB:PI:PW:PC  ./.:76,0:76:0,0,0:./.:.:.:./.:. ./.:55,0:55:0,0,0:.:.:.:.:. ./.:68,0:68:0,0,0:.:.:.:.:. ./.:48,0:48:0,0,0:.:.:.:.:.
2   1466    .   TG  T   395.82  PASS    AC=0;AF=0.00;AN=0;BaseQRankSum=1.01;ClippingRankSum=0.00;DP=287;ExcessHet=5.1188;FS=7.707;MQ=18.00;MQRankSum=0.00;QD=17.21;ReadPosRankSum=1.28;SOR=0.074;set=InDels GT:AD:DP:PL:PG:PB:PI:PW:PC  ./.:95,0:95:0,0,0:./.:.:.:./.:. ./.:64,0:64:0,0,0:.:.:.:.:. ./.:75,0:75:0,0,0:.:.:.:.:. ./.:53,0:53:0,0,0:.:.:.:.:.
2   1467    .   G   T   1334.42 PASS    AC=0;AF=0.00;AN=0;BaseQRankSum=0.674;ClippingRankSum=0.00;DP=287;ExcessHet=4.8226;FS=1.328;MQ=23.36;MQRankSum=0.00;QD=28.65;ReadPosRankSum=-4.310e-01;SOR=0.566;set=SNPs    GT:AD:DP:PL:PG:PB:PI:PW:PC  ./.:95,0:95:0,0,0:./.:.:.:./.:. ./.:64,0:64:0,0,0:.:.:.:.:. ./.:75,0:75:0,0,0:.:.:.:.:. ./.:53,0:53:0,0,0:.:.:.:.:.
2   1516    .   C   T   5902.93 PASS    AC=2;AF=0.250;AN=8;BaseQRankSum=0.287;ClippingRankSum=0.00;DP=411;ExcessHet=0.5065;FS=1.489;InbreedingCoeff=0.3492;MQ=59.77;MQRankSum=0.00;QD=28.38;ReadPosRankSum=-7.100e-02;SOR=0.553;set=SNPs    GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC   0/0:122,0:122:99:0,120,1800:0/0:.:.:0/0:.   1/1:1,108:109:99:3935,286,0:.:.:.:.:.   0/0:102,0:102:99:0,120,1800:.:.:.:.:.   0/0:78,0:78:99:0,120,1800:.:.:.:.:.
2   1584    .   CT  C   164.08  PASS    AC=0;AF=0.00;AN=8;DP=717;ExcessHet=0.0812;FS=0.000;InbreedingCoeff=0.9386;MQ=60.00;QD=32.82;SOR=3.611;set=InDels    GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC   0/0:122,0:122:99:0,120,1800:0/0:.:.:0/0:.   0/0:172,0:172:99:0,120,1800:.:.:.:.:.   0/0:102,0:102:99:0,120,1800:.:.:.:.:.   0/0:321,0:321:99:0,120,1800:.:.:.:.:.
2   1609    .   C   A   604.68  PASS    AC=0;AF=0.00;AN=0;DP=386;ExcessHet=0.1158;FS=0.000;InbreedingCoeff=0.8938;MQ=12.32;QD=31.09;SOR=1.061;set=SNPs  GT:AD:DP:PL:PG:PB:PI:PW:PC  ./.:0,0:0:0,0,0:./.:.:.:./.:.   ./.:0,0:0:0,0,0:.:.:.:.:.   ./.:0,0:0:0,0,0:.:.:.:.:.   ./.:386,0:386:0,0,0:.:.:.:.:.
2   1612    .   TGTGAGCTATTTCTTTTACATTTTTCTTTAGATTCTAGGTTAAATTGTGAAGCTGATTATCTTTTTTGTTTACAG T   1298.76 PASS    AC=2;AF=1.00;AN=2;DP=3;ExcessHet=0.1047;FS=0.000;InbreedingCoeff=0.8896;MQ=60.02;QD=29.54;SOR=1.179;set=InDels  GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC   ./.:0,0:0:.:0,0,0:./.:.:.:./.:. ./.:0,0:0:.:0,0,0:.:.:.:.:. ./.:0,0:0:.:0,0,0:.:.:.:.:. 1/1:0,3:3:99:1355,582,0:.:.:.:.:.
2   1657    .   T   A,* 3118.91 PASS    AC=0,2;AF=0.00,1.00;AN=2;BaseQRankSum=0.578;ClippingRankSum=0.00;DP=4;ExcessHet=1.9114;FS=3.474;InbreedingCoeff=0.0821;MQ=26.68;MQRankSum=0.841;QD=28.10;ReadPosRankSum=-5.960e-01;SOR=0.821;set=SNPs   GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC   ./.:0,0,0:0:.:0,0,0,0,0,0:./.:.:.:./.:. ./.:1,0,0:1:.:0,0,0,0,0,0:.:.:.:.:. ./.:0,0,0:0:.:0,0,0,0,0,0:.:.:.:.:. 2/2:0,0,3:3:99:1355,1360,1393,582,615,0:.:.:.:.:.
2   1738    .   A   G   4693.24 PASS    AC=2;AF=0.250;AN=8;BaseQRankSum=0.00;ClippingRankSum=0.00;DP=1595;ExcessHet=0.0577;FS=0.621;InbreedingCoeff=0.6496;MQ=60.00;MQRankSum=0.00;QD=5.46;ReadPosRankSum=0.307;SOR=0.773;set=SNPs  GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC   0/1:389,92:481:99:1748,0,12243:0|1:.,.,.,.,.:935:|:0.5  0/0:318,0:318:99:0,120,1800:.:.:.:.:.   0/1:270,53:323:99:990,0,9096:.:.:.:.:.  0/0:473,0:473:99:0,120,1800:.:.:.:.:.
2   2781    .   T   G   435.07  PASS    AC=1;AF=0.125;AN=8;BaseQRankSum=0.624;ClippingRankSum=0.00;DP=2146;ExcessHet=3.4523;FS=8.450;InbreedingCoeff=-0.0856;MQ=60.06;MQRankSum=-4.630e-01;QD=1.27;ReadPosRankSum=-5.980e+00;SOR=1.436;set=SNPs GT:AD:DP:GQ:PGT:PID:PL:PG:PB:PI:PW:PC   0/0:620,0:620:99:.:.:0,120,1800:0/0:.:.:0/0:.   0/1:309,34:343:99:0|1:2781_T_G:469,0,12941:.:.:.:.:.    0/0:492,0:492:99:.:.:0,120,1800:.:.:.:.:.   0/0:691,0:691:99:.:.:0,120,1800:.:.:.:.:.
2   2786    .   C   G   39.69   PASS    AC=0;AF=0.00;AN=8;BaseQRankSum=0.881;ClippingRankSum=0.00;DP=2145;ExcessHet=4.3933;FS=0.000;InbreedingCoeff=-0.1367;MQ=52.41;MQRankSum=-1.356e+00;QD=1.13;ReadPosRankSum=0.577;SOR=0.527;set=SNPs   GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC   0/0:620,0:620:99:0,120,1800:0/0:.:.:0/0:.   0/0:342,0:342:99:0,120,1800:.:.:.:.:.   0/0:492,0:492:99:0,120,1800:.:.:.:.:.   0/0:691,0:691:99:0,120,1800:.:.:.:.:.
2   2787    .   T   C   993.78  PASS    AC=1;AF=0.125;AN=8;BaseQRankSum=-2.967e+00;ClippingRankSum=0.00;DP=2153;ExcessHet=3.8663;FS=4.941;InbreedingCoeff=-0.1076;MQ=60.06;MQRankSum=-5.100e-01;QD=2.84;ReadPosRankSum=-3.689e+00;SOR=0.875;set=SNPs    GT:AD:DP:GQ:PGT:PID:PL:PG:PB:PI:PW:PC   0/0:620,0:620:99:.:.:0,120,1800:0/0:.:.:0/0:.   0/1:309,41:350:99:0|1:2781_T_G:1027,0,13619:.:.:.:.:.   0/0:492,0:492:99:.:.:0,120,1800:.:.:.:.:.   0/0:691,0:691:99:.:.:0,120,1800:.:.:.:.:.
2   2792    .   A   G   745.21  PASS    AC=1;AF=0.125;AN=8;BaseQRankSum=0.271;ClippingRankSum=0.00;DP=2176;ExcessHet=5.9256;FS=5.964;InbreedingCoeff=-0.2087;MQ=59.48;MQRankSum=-4.920e-01;QD=1.83;ReadPosRankSum=-3.100e-02;SOR=1.389;set=SNPs GT:AD:DP:GQ:PGT:PID:PL:PG:PB:PI:PW:PC   0/0:620,0:620:99:.:.:0,120,1800:0/0:.:.:0/0:.   0/1:332,41:373:99:0|1:2781_T_G:705,0,13295:.:.:.:.:.    0/0:492,0:492:99:.:.:0,120,1800:.:.:.:.:.   0/0:691,0:691:99:.:.:0,120,1800:.:.:.:.:.

I also tried adding FS/OFS parameters but it isn't working.

everestial007
  • 6,665
  • 7
  • 32
  • 72
  • What does this have to do with Python? Or sed? Or select? – Scott Hunter Dec 30 '16 at 20:33
  • I just added the tag so it reaches to larger number of people. Also, this is something I want to pipe into my python script. So – everestial007 Dec 30 '16 at 20:34
  • 1
    Adding unrelated tags may get more people looking; it may have the opposite affect on how many answers you get. – Scott Hunter Dec 30 '16 at 20:36
  • 1
    Assumming that "GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC" is the content of field `$3`, then `l=split(...)` returns the number of fields in `a`, so you can loop `for(i=1;i<=l;i++){ work with a[i] }` and build a string from `a[1] , ..., a[l-5]` similar to the answer given by Jean below. – Lars Fischer Dec 30 '16 at 21:13

2 Answers2

4

After some clarification what the file looks like, here is my updated answer:

You can simply use

awk 'BEGIN{FS=OFS="\t"} {$9 = gensub(/(:[^:]+){5}$/,"","1",$9)} 1' yourfile
  • Here we use the standard awk field splitting, since your file is tab-separated.
  • We further do a regular expression replacement scoped to $9, which is the colon-separated string you want to change.
  • The regular expression works the same as in the old answer, in which I had the impression that the line consists only of the colon-separated string.

Old Answer Since you wrote "pipe to python" in your comment, maybe you are open to an sed solution?

sed -r "s/(:[^:]+){5}$//" yourfile

Here we replace (s/...// replace the ... with nothing), the ... means:

  • from the end of line ($)
  • five ({5})
  • occurences of colon (:)
  • followed by something (+)
  • not a colon ([^:])

And this can again be "translated" to awk:

awk -F: 'BEGIN{FS=OFS="\t"} {$0 = gensub(/(:[^:]+){5}$/,"","1")} 1' yourfile
everestial007
  • 6,665
  • 7
  • 32
  • 72
Lars Fischer
  • 9,135
  • 3
  • 26
  • 35
  • I think more details would help: like how to read and manipulate only the 3rd column, but print the other columns too without any changes. – everestial007 Dec 30 '16 at 20:56
  • @everestial007 If you want to operate on certain fields/columns, then awk is **the right** tool. – Lars Fischer Dec 30 '16 at 21:01
  • 1
    @everestial007 After the edit of the question: you can use this awk command: `awk '{$9 = gensub(/(:[^:]+){5}$/,"","1",$9)} 1' yourfile` to scope the modification to field `$9`. – Lars Fischer Dec 30 '16 at 21:26
  • All you need is `awk '{sub(/(:[^:]+){5}$/,"",$9)} 1' yourfile`. No reason to make it gawk-specific and more complicated with `gensub()`. – Ed Morton Dec 30 '16 at 22:38
  • @LarsFischer: Thanks for the answer. I had to add `BEGIN{FS=OFS="\t"}` to get what I exactly need. Thanks much for the help. – everestial007 Dec 31 '16 at 16:37
3

Maybe not the best awk solution but works:

awk -F: '{printf($1); for (i=2;i<=NF-5;i++) printf(":%s",$i); printf("\n"); }' file.txt
  • split the fields naturally according to colon
  • print first field, and then other fields minus the 5 last ones (using NF: number of fields preset variable), with leading colon.
  • print a linefeed to end the line.

EDIT: I knew there was better to do using awk. As Lars commented, this is way simpler and cleaner:

awk -F: '{s= $1; for(i = 2; i<= NF-5;i++) s= s FS $i; print s}'
  • use separator value instead of hardcoded colon
  • compose string instead of printing all fields
  • print string in the end

If you want to use it within a python script, I'd suggest that you write that in python, simpler & faster:

import csv

with open("file.txt") as fr, open("out.txt","w",newline="") as fw:
    cr = csv.reader(fr,delimiter=":")
    cw = csv.writer(fw,delimiter=":")
    for row in cr:
        cw.writerow(row[:-5])  # write the row but the 5 last fields

you can omit the with part if you already have open handles.

EDIT: since you heavily edited your question after my answer, now you want to remove the 5 last "fields" from one particular field (tab-separated). Lars has answered properly awk-wise, let me propose my python solution:

import csv

with open("file.txt") as fr, open("out.txt","w",newline="") as fw:
    cr = csv.reader(fr,delimiter="\t")
    cw = csv.writer(fw,delimiter="\t")
    for row in cr:
        row[8]=":".join(row[8].split(":")[:-5])  # remove 5 last "fields" from 8th field
        cw.writerow(row)  # write the modified row
Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
  • Using print and FS you can simplify a bit: `awk -F: '{s= $1; for(i = 2; i<= NF-5;i++) s= s FS $i; print s}' file` – Lars Fischer Dec 30 '16 at 20:43
  • @LarsFischer I NEVER succeed in writing a simple awk script :) thx edited (will I ever learn from awk experts?) – Jean-François Fabre Dec 30 '16 at 20:46
  • Its not actually working. The ouput is like: `GT:PW:PC`. It removing the field from between the beginning and end of the columns. – everestial007 Dec 30 '16 at 20:50
  • works fine here. Make sure that your file doesn't have \r / carriage return chars in it. Perform a dos2unix on it to be sure. – Jean-François Fabre Dec 30 '16 at 20:51
  • Its looks like the another script `awk -F: '{s= $1; for(i = 2; i<= NF-5;i++) s= s FS $i; print s}' file` is affecting other columns too. I want the removal to be specific to `3rd` column only. I tried changing the value of `s` but the results are not looking what I expected. – everestial007 Dec 30 '16 at 20:54
  • I really don't understand your issue now. Your question was clear, though. – Jean-François Fabre Dec 30 '16 at 20:57
  • Hi@Jean-FrançoisFabre: I am also confused over why the script worked over the data where I only had 3 columns. But, when I wanted to do the similar update with another big data it failed. I have updated the question details with more info on how my large file actually looks. But, what I am trying to do with a `mock.file` vs. `actual.file` are the samething. I tried to check `carriages` within my big data but there are none. Please have a look. – everestial007 Dec 30 '16 at 21:18
  • 1
    I just read your edit, now I understand your full problem. I proposed a python solution that does it, if you're interested. – Jean-François Fabre Dec 30 '16 at 21:35
  • Thanks @Jean-FrançoisFabre: The python solution is good. I will see how efficient `awk vs python` script becomes in terms of memory and time. But, please make sure to leave all the solutions here. I am sure it will be helpful for me in the future if not for someone else. I also added the `python` tag to the question if someone is need of similar solution. Thanks ! – everestial007 Dec 31 '16 at 16:26
  • I won't remove any of the text above, don't worry. – Jean-François Fabre Dec 31 '16 at 16:42