0

I have a problem, please could you help me? I have the .txt file I use the awk code to separate columns using the tab delimited function. I receive the align columns, but if some information missing, the whole row turn left and information are shifted to another column. Please could you help me? How could I incorporate to this code? Thank you so much..

#!/bin/bash

for f 
in *.vcf; 
do awk 'BEGIN {OFS = "\t"}
        /^##/ {next}
        /^#/ {sub(/^#/,"",$1)}
        {$1=$1; print}
    ' "$f" > "${f/%vcf/tsv}"
done


> INPUT:
> 
> CHROM    ID    REF   ALT 
  chr1    235     A     B 
  chr2     A      B
  chr3    225     B

OUTPUT:

  CHROM    ID   REF   ALT  
  chr1    235    A     B 
  chr2     .     A     B 
  chr3    225    .     B
Vonton
  • 2,872
  • 4
  • 20
  • 27

1 Answers1

2

The problem contains ambiguities. Looking at the data:

chr1    235     A     B 
chr2     A      B
chr3    225     B

perhaps in the chr2 row, we can guess that the ID column is missing because IDs are numbers. We are missing one column and it's the numeric one so it must be ID.

But in the third row, how do we know that the REF column is missing, rather than ALT?

If ALT is never missing, then it's simple. But if either could be missing, it may be impossible.

In any case, before you can write the program code to renormalize the data into proper columns, you have to be able to articulate the rules for identifying what columns are missing, or else recognize that it is impossible and give up.

You may simply have to go upstream and find a better source of the same data which doesn't have munged columns.

Kaz
  • 55,781
  • 9
  • 100
  • 149