Printing a specific number of nucleotides

Question

I have a vcf statistics for heterozygote and homozygote cases and I would like to find matches with my maf file. The issue is that the reference field in maf file is different and it exlcudes nucleotides in alternative states, e.g. if you have a ref CAA and alternative variant is CAAAAA, in maf file your ref would be AAA.

So I need a code to change the ref field and alt in my file with statistics (may be add separate columns ref2 and alt2)

Here is a snippet of my file:

CHR POS ID REF ALT chr11 71579744 rs71049992 A ACAGCAGCTGGACTGGGAGCAGCAGGACCTG (insertion case)

chr11 124880551 rs71859853 CCGGAGT C (deletion case)

I think I should first count numbers of nucleotides in column4 and 5. then if number in column 4 is greater than 5 (meaning deletion), then in my ref2 that position will start from the next nucleotide different from alternative one.

For insertion, I will have an alt site changed and skipped ref nucleotides

As a result, I would like to have this:

CHR POS ID REF ALT REF2 ALT2 chr11 71579744 rs71049992 A ACAGCAGCTGGACTGGGAGCAGCAGGACCTG A CAGCAGCTGGACTGGGAGCAGCAGGACCTG

chr11 124880551 rs71859853 CCGGAGT C CGGAGT C

Thank you very much in advance!

You talk about a _maf file_, but the _snippet_ you show doesn't look like [MAF](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/). — Armali, Aug 28 '21 at 20:50
Hi Armali, Yes, this is a snippet from vcf statistics. I want to find a match between this file and my maf — Anna, Sep 05 '21 at 12:54
The shown _snippet_ looks strange - it seems to start with column headers _CHR POS ID REF ALT_, but then data follows in the same line; is the _file_ really (un)structured like that? — Armali, Sep 05 '21 at 15:24
Sorry that should be rather two lines CHR POS ID REF ALT (new line) chr11 71579744 rs71049992 A ACAGCAGCTGGACTGGGAGCAGCAGGACCTG — Anna, Sep 05 '21 at 17:11
@Armali To give a better example. Here is a snippet of my maf file: AC215217.1 chr10 21198 21205 AAAAAAAA - 5'Flank DEL GRD01 AAA rs200438810 NA MODIFIER ENSG00000260370 upstream_gene_variant . 2256.16 21194 NA AC215217.1 (new line) chr10 21199 21205 AAAAAAA - 5'Flank DEL SAS05 AAAA rs200438810 NA MODIFIER E NSG00000260370 ENST00000566940 upstream_gene_variant . 2256.16 21194 NA — Anna, Sep 05 '21 at 17:18
And here is the statistics derived from the vcf file for the same position: chr10 21194 . CAAAAAAAAAAA CAAAA 3 0 14 chr10 21194 . CAAAAAAAAAAA CAAAAAAA 5 1 11 chr10 21194 . CAAAAAAAAAAA CAAA 3 3 11 chr10 21194 . CAAAAAAAAAAA CAAAAAA 3 2 12 chr10 21194 . CAAAAAAAAAAA CAAAAA 2 1 14 chr10 21194 . CAAAAAAAAAAA C 1 0 16 — Anna, Sep 05 '21 at 17:22
So as you can see, the ref and alt does not match since in the maf file in case of indel, the common part of nucleotides in ref and alt are missed. This is why I am not able to find a match. — Anna, Sep 05 '21 at 17:23
I'm sorry, but I have not the required domain knowledge to make sense of that. I may have been able to suggest a method to covert the snippet from the question to the desired result, but now in the _better example_ the _statistics_ has eight columns instead of five, and there's no desired result given, so I don't know what to do. — Armali, Sep 05 '21 at 17:34
@Armali. Thank you for your kind reply! In the better example there are 8 columns, so that the issue is to somehow took only non-matching nucleotides from 4 and 5 colums and match those to the ref column in maf file — Anna, Sep 05 '21 at 17:56
If there is a match then I should add column 6 from the vcf stats file to my maf file — Anna, Sep 05 '21 at 18:04
If I apply the changes from the original example (as I understand them) to the better example, there would also be two additional columns _REF2 ALT2_ with the values AAAAAAA CAAAA, AAAA CAAAAAAA, AAAAAAAA CAAA, AAAAA CAAAAAA, AAAAAA CAAAAA and AAAAAAAAAAA C; is this what you'd like to have? — Armali, Sep 05 '21 at 18:37
@Armali Yes, exactly! That would be the case for deletion. For insertion, my ALT2 column should contain non-matching nucleotides. I guess I should count first which column (4 or 5) has more nucleotides in my vcf stats file? — Anna, Sep 05 '21 at 19:01

score 0 · Accepted Answer · answered Sep 05 '21 at 20:18

0

I think I should first count numbers of nucleotides in column4 and 5…

With awk, you can use the length function to count numbers of nucleotides:

awk 'NR==1 {print $0" REF2 ALT2"}       # assuming first line has column headers
     NR>1  {if (length($4)<length($5)) print $0, $4, gensub($4, "", 1, $5)
            else                       print $0, gensub($5, "", 1, $4), $5}' file

answered Sep 05 '21 at 20:18

Armali

18,255
14
57
171

1

Thank you very much! Currently trying to understand the code as something does not work properly. awk 'NR==1 {print $0" REF2 ALT2"} NR>1 {if (length($4)1 {if (length($4) – Anna Sep 06 '21 at 08:18
If the `else` isn't at the beginning of a line, a `;` must be put before it. – Armali Sep 06 '21 at 08:45
Just curious, could you please explain this part gensub($4, "", 1, $5) – Anna Sep 06 '21 at 08:49
[`gensub`](https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html)`($4, "", 1, $5)` substitutes the `1`st match of `$4` (column 4) in `$5` (column 5) with `""` (nothing, i. e. deletes the match); column 5 is not changed, but the modified string is returned by the function. – Armali Sep 06 '21 at 09:08
Armali, may I ask you to help with another issue. – Anna Sep 28 '21 at 08:40
I have two files and first, want to look at the column with variant type in one of the files. If it is DEL then look at the match between three columns in two files (Chromosome, position, Reference) and append new columns in the first file with AC and AF coming from the second file. If it is IND then ook at the match between three columns in two files (Chromosome, position, Alternative) and append new columns in the first file with AC and AF coming from the second file. – Anna Sep 28 '21 at 08:42
I wonder if awk will do the job easier, as I am failing to do that in R now – Anna Sep 28 '21 at 08:43
This comment thread is not a good place to discuss _another issue._ You could post a new question (better with a few sample lines from the _two files_ and sample desired output) and leave a link to the new question here; then I'd have a look. – Armali Sep 28 '21 at 12:48
https://stackoverflow.com/questions/69361764/match-between-the-columns?noredirect=1#comment122597067_69361764 – Anna Sep 28 '21 at 14:15

Printing a specific number of nucleotides

1 Answers1