AWK to handle bed files

Question

I would like to grep and separate fields from bed files to generate a new bed file with these new arranged data.

I would go from here:

1   15903   rs557514207 G   G,A RS=557514207;RSPOS=15903;dbSNPBuildID=142;SSR=0;SAO=0;VP=0x050000000005150026000200;GENEINFO=WASH7P:653635;WGT=1;VC=DIV;ASP;VLD;G5;KGPhase3;CAF=0.5589,.,0.4411;COMMON=1;TOPMED=0.30307084607543323,0.00039022680937818,0.69653892711518858`
1   11012   rs544419019 C   G   RS=544419019;RSPOS=11012;dbSNPBuildID=142;SSR=0;SAO=0;VP=0x050000020005150024000100;GENEINFO=DDX11L1:100287102;WGT=1;VC=SNV;R5;ASP;VLD;G5;KGPhase3;CAF=0.9119,0.08806;COMMON=1`
1   15903   rs557514207 G   G,C RS=557514207;RSPOS=15903;dbSNPBuildID=142;SSR=0;SAO=0;VP=0x050000000005150026000200;GENEINFO=WASH7P:653635;WGT=1;VC=DIV;ASP;VLD;G5;KGPhase3;CAF=0.5589,.,0.4411;COMMON=1;TOPMED=0.30307084607543323,0.00039022680937818,0.69653892711518858

To here:

1   15903   rs557514207 G   G   CAF=0.5589,.
1   15903   rs557514207 G   A   CAF=0.5589,0.4411
1   11012   rs544419019 C   G   CAF=0.9119,0.08806
1   15903   rs557514207 G   G   CAF=0.5589,.
1   15903   rs557514207 G   C   CAF=0.5589,0.4411

So separating column 5 by comma and add a new line and separating column 6 by Word CAF= and also the values that correspond to column 5 and keep the information in the new lines. Column 6 includes a strings, concatenated by semicolon. I'm interessted in the part ;CAF=value1,value2; between the semicolon. Resulting in this example into two new lines CAF=value1 CAF=value2, which is connected to the split of G,A two new lines for G and A.

Please [edit] your question to provide your code attempt to solve the problem yourself and reformat your sample input/output as Code Blocks (see https://stackoverflow.com/help/formatting), same as your code. — Ed Morton, May 16 '23 at 11:41
How many items can be in column 5? More than 2? Less than 1? What is rule for splitting CAF item? — jhnc, May 16 '23 at 12:11
after CAF= the values are stored, which should be preserved in next rows. — DaN, May 16 '23 at 13:24
From what you described currently it's unknown (without guessing...) how the column with `CAF=` data you be created. — Luuk, May 16 '23 at 13:39
The 6. column includes text, separated by semicolon, all together in one string. I'm interested in the part: ;CAF=value1,value2; between the semicolons and the rest can be removed. Resulting in this example into two new lines CAF=value1 CAF=value2 — DaN, May 16 '23 at 13:43
In you example output, only 1 time a `CAF=` is shown. From your last [comment](https://stackoverflow.com/questions/76261041/awk-to-handle-bed-files#comment134487380_76261041), it is unclear where/how to get value1 and value2 (and example output compared to given input does not make it clear also... ) — Luuk, May 16 '23 at 14:03
@luuk, the last field of the second record contains `...;CAF=0.9119,0.08806;...` — Fravadona, May 16 '23 at 15:16
@Fravadona:yeas? and the first line has: `...5;KGPhase3;CAF=0.5589,.,0.4411;COMMON=1;TOP...` , with **3** comma-separated values after `CAF=`. — Luuk, May 16 '23 at 15:46

jhnc · Accepted Answer · 2023-05-16T12:58:05.717

1

awk -F'\t' -v OFS='\t' '
  {
    # split column 6; CAF part starts from element 2
    split($6, c6, /^.*CAF=|,|;.*$/)

    # split column 5
    n=split($5, c5, /,/)

    # print initial columns and relevant parts of 5 and 6
    for (i=1; i<=n; i++)
      print $1,$2,$3,$4, c5[i], "CAF="c6[2]","c6[2+i]
  }
' infile >outfile

edited May 16 '23 at 12:58

answered May 16 '23 at 12:27

jhnc

11,310
1
9
26

in the output the CAF values are missing: 1 10177 rs367896724 A AC CAF=, – DaN May 16 '23 at 13:25
I checked again but the c6 seems to be empty, maybe the error is in the split line: split($6, c6, /^.*CAF=|,|;.*$/) ...it splits the CAF= but not the single values afterwards, that are separated by , until the ; – DaN May 17 '23 at 08:40
Your sample data is tab delimited. Is that how the real data is delimited? Does `awk -F'\t' 'NR==1{print $6}' infile` show the correct column? – jhnc May 17 '23 at 09:16
I used: GNU Awk 5.1.0, API: 3.0 (GNU MPFR 4.1.0, GNU MP 6.2.1) under Ubuntu 22.04, yes data is tabbed. – DaN May 17 '23 at 10:18
1

indeed i have another file where it was col 8 and not 6, so I have to care about this and check the data again, but now it works as expected, thanks! – DaN May 17 '23 at 10:53

AWK to handle bed files

1 Answers1