0

I have a table snp150Common.txt, where the second and third fields $2 and $3 can be equal or not.

If they are equal, I want $2 to become $2-1, so that:

chr1    10177   10177   rs367896724 -   -   -/C insertion   near-gene-5
chr1    10352   10352   rs555500075 -   -   -/A insertion   near-gene-5
chr1    11007   11008   rs575272151 C   C   C/G single      near-gene-5
chr1    11011   11012   rs544419019 C   C   C/G single      near-gene-5
chr1    13109   13110   rs540538026 G   G   A/G single      intron
chr1    13115   13116   rs62635286  T   T   G/T single      intron
chr1    13117   13118   rs62028691  A   A   C/T single      intron
chr1    13272   13273   rs531730856 G   G   C/G single      ncRNA
chr1    14463   14464   rs546169444 A   A   A/T single      near-gene-3,ncRNA

becomes:

chr1    10176   10177   rs367896724 -   -   -/C insertion   near-gene-5
chr1    10351   10352   rs555500075 -   -   -/A insertion   near-gene-5
chr1    11007   11008   rs575272151 C   C   C/G single      near-gene-5
chr1    11011   11012   rs544419019 C   C   C/G single      near-gene-5
chr1    13109   13110   rs540538026 G   G   A/G single      intron
chr1    13115   13116   rs62635286  T   T   G/T single      intron
chr1    13117   13118   rs62028691  A   A   C/T single      intron
chr1    13272   13273   rs531730856 G   G   C/G single      ncRNA
chr1    14463   14464   rs546169444 A   A   A/T single      near-gene-3,ncRNA

My current command adapted from https://askubuntu.com/a/312843:

zcat < snp150/snp150Common.txt.gz | head | awk '{ if ($2 == $3) $2=$2-1; print $0 }' | cut -f 2,3,4,5,8,9,10,12,16

gives the same output:

chr1    10177   10177   rs367896724 -   -   -/C insertion   near-gene-5
chr1    10352   10352   rs555500075 -   -   -/A insertion   near-gene-5
chr1    11007   11008   rs575272151 C   C   C/G single      near-gene-5
chr1    11011   11012   rs544419019 C   C   C/G single      near-gene-5
chr1    13109   13110   rs540538026 G   G   A/G single      intron
chr1    13115   13116   rs62635286  T   T   G/T single      intron
chr1    13117   13118   rs62028691  A   A   C/T single      intron
chr1    13272   13273   rs531730856 G   G   C/G single      ncRNA
chr1    14463   14464   rs546169444 A   A   A/T single      near-gene-3,ncRNA

Any help is greatly appreciated.

Carmen Sandoval
  • 2,266
  • 5
  • 30
  • 46
  • 1
    Your `cut` (which is pretty foo as you are already using awk) indicates that the `snp150Common.txt` has more columns than you show above. Are you sure `$2` and `$3` are the columns you really want to compare? – James Brown May 07 '18 at 03:31
  • And *that* is why I suck at programming. Thank you! – Carmen Sandoval May 07 '18 at 03:34

1 Answers1

1

This answer is based on pure speculation of the source file format:

$ zcat snp150/snp150Common.txt.gz | 
  awk '
  BEGIN { OFS="\t" }                       # field separators are most likely tabs
  {
      if ($3 == $4)                        # based on cut these should be compared
          $3=$3-1
      print $2,$3,$4,$5,$8,$9,$10,$12,$16  # ... and there fields printed
  }
  NR==10 { exit }'                         # this replaces head

And remember: Practising (anything but sucking) makes you suck less.

James Brown
  • 36,089
  • 7
  • 43
  • 59