1

I have two files. First, I want to look at the column "Variant_Type" in the first file. If it is DEL, then I should look if there is a match between three columns in two files (Chromosome, vcf_pos, Reference_Allele) and append AC and AF columns from the first file to the second file. If it is INS in a "Variant_Type", then I look for the match between another three columns in the two files (Chromosome, vcf_pos, Tumor_Seq_Allele2) and append relevant AC and AF columns from the second file. If it is SNP, then again look for the match between another three columns in the two files (Chromosome, vcf_pos, Tumor_Seq_Allele2) and append relevant AC and AF columns from the second file.

Here is the snippet from File 1

Hugo_Symbol Chromosome Start_Position End_Position Reference_Allele Tumor_Seq_Allele2 Variant_Type vcf_pos 
TMEM80      chr11      704605         704605       A                -                 DEL          704604 
OR52P1P     chr11      5726537        5726537      T                -                 DEL          5726536
UBTFL1      chr11      90086720       90086721     -                T                 INS          90086720
DCPS        chr11      126306583      126306584    -                TGGGGA            INS          126306583
DCPS        chr11      126306583      126306584    -                TGGGGAAA          INS          126306583

File 2

Chromosome vcf_pos      AF       AC      Reference_Allele  Tumor_Seq_Allele2
chr11      704604       0.2      10      A                 - 
chr11      5726536      0.35     13      T                 -
chr11      90086720     0.25     16      -                 T
chr11      126306583    0.5      29      -                 TGGGGA 
chr11      126306583    0.3      39      -                 TGGGGAAA

Desired output

Hugo_Symbol Chromosome Start_Position End_Position Reference_Allele Tumor_Seq_Allele2 Variant_Type vcf_pos   AF   AC 
TMEM80      chr11      704605         704605       A                -                 DEL          704604    0.2  10
OR52P1P     chr11      5726537        5726537      T                -                 DEL          5726536   0.35 13
UBTFL1      chr11      90086720       90086721     -                T                 INS          90086720  0.25 16
DCPS        chr11      126306583      126306584    -                TGGGGA            INS          126306583 0.5  29
DCPS        chr11      126306583      126306584    -                TGGGGAAA          INS          126306583 0.3  39

As a possible solution I was thinking about merge function in R, but might work better with awk

MyICQ
  • 987
  • 1
  • 9
  • 25
Anna
  • 53
  • 6
  • Please next time properly format your files in columns using tags. It helps to help others help you. [See here how](https://stackoverflow.com/help/formatting) – MyICQ Sep 28 '21 at 13:30
  • Kindly do add your efforts to avoid downvotes and close-votes to your question. Adding efforts in question is highly encouraged on SO, thank you. – RavinderSingh13 Sep 28 '21 at 14:20
  • [Not generally.](https://meta.stackoverflow.com/questions/411791/making-canonical-answers-easier-to-find?cb=1#comment869985_411791) – Armali Sep 28 '21 at 14:37
  • What does this link mean? Should I change smth in my question as it is now? – Anna Sep 28 '21 at 14:44
  • 1
    @MyICQ - While your formatting looks better, it changes the file structure where the columns are separated by a single space, so one might even say you distorted the post. – Armali Sep 28 '21 at 14:45
  • @Anahit Hovhannisyan - Don't worry, _this link_ was not directed to you. – Armali Sep 28 '21 at 14:46
  • Would it be acceptable if the output rows are ordered differently? (I may be helpful to have them sorted.) – Armali Sep 28 '21 at 18:44
  • 1
    @Armali I get your point, no intention to do this. Will be more careful editing space-separated files in future. – MyICQ Sep 29 '21 at 07:56
  • @Armali, Yes order does not matter much. – Anna Sep 29 '21 at 19:17

2 Answers2

0

As a possible solution I was thinking about merge function in R

Indeed the R merge function together with plain indexing can do it.

t1 = read.table('File 1', T)
t2 = read.table('File 2', T)
AFAC = c('AF', 'AC')        # columns to copy
l = t1$Variant_Type=='DEL'  # rows to process
t1[l, AFAC] = merge(t1[l, c('Chromosome', 'vcf_pos', 'Reference_Allele')], t2, sort=F)[AFAC]
l = t1$Variant_Type %in% c('INS', 'SNP')
t1[l, AFAC] = merge(t1[l, c('Chromosome', 'vcf_pos', 'Tumor_Seq_Allele2')], t2, sort=F)[AFAC]
write.table(t1, 'output', F, F)

Since the match columns for INS and SNP are the same, the processing of these two Variant_Types can be combined.

Armali
  • 18,255
  • 14
  • 57
  • 171
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/237901/discussion-on-answer-by-armali-match-between-the-columns). – Machavity Oct 06 '21 at 16:22
0

might work better with awk

Just for comparison - a solution with awk:

awk 'BEGIN { while (getline <"File 2" > 0)      # make "dictionary" of AF, AC for ...
             { AFAC1[$1","$2","$5] = $3" "$4    # ... Chr., vcf. and Reference_Allele
               AFAC2[$1","$2","$6] = $3" "$4    # ... Chr., vcf. and Tumor_Seq_Allele2
             }
           }
     NR==1     { print $0" AF AC" }             # first line has column headers
     $7=="DEL" { print $0, AFAC1[$2","$8","$5] }# append the 1st stored AF, AC
     $7=="INS"||
     $7=="SNP" { print $0, AFAC2[$2","$8","$6] }# append the 2nd stored AF, AC
    ' "File 1"
Armali
  • 18,255
  • 14
  • 57
  • 171
  • 1
    Thank you very much for your kind help and all the commands explanation! – Anna Oct 01 '21 at 15:57
  • Sorry for disturbing with my questions, also could you explain this part "while (getline <"File 2" > 0)" – Anna Oct 04 '21 at 11:42
  • This loops over all lines in "File 2", reading one line after the other, and executes the following block for each line in order to build the in-memory lookup tables used later when reading "File 1". – Armali Oct 04 '21 at 12:10
  • I am trying now with awk, but having some error with syntax – Anna Oct 04 '21 at 18:38
  • The command (the field positions is a bit different) awk 'BEGIN { while (getline < af_ac_no_duplicates > 0) {AFAC1[$1","$2","$8"] = $6" "$7 AFAC2[$1","$2","$9] = $6" "$7 } } NR==1 { print $0" AF AC" } $8=="DEL" { print $0, AFAC1[$2","$21","$5"] } $8=="INS"|| $8=="SNP" { print $0, AFAC2[$2","$21","$6] } ' "maf_unique.maf" – Anna Oct 04 '21 at 18:38
  • awk: cmd. line:1: BEGIN { while (getline < af_ac_no_duplicates > 0) {AFAC1[$1","$2","$8"] = $6" "$7 AFAC2[$1","$2","$9] = $6" "$7 } } NR==1 { print $0" AF AC" } $8=="DEL" { print $0, AFAC1[$2","$21","$5"] } $8=="INS"|| $8=="SNP" { print $0, AFAC2[$2","$21","$6] } awk: cmd. line:1: ^ syntax error – Anna Oct 04 '21 at 18:39
  • awk: cmd. line:1: BEGIN { while (getline < af_ac_no_duplicates > 0) {AFAC1[$1","$2","$8"] = $6" "$7 AFAC2[$1","$2","$9] = $6" "$7 } } NR==1 { print $0" AF AC" } $8=="DEL" { print $0, AFAC1[$2","$21","$5"] } $8=="INS"|| $8=="SNP" { print $0, AFAC2[$2","$21","$6] } awk: cmd. line:1: ^ unexpected newline or end of string – Anna Oct 04 '21 at 18:39
  • There is an extra, wrong `"` after `AFAC1[$1","$2","$8` and `AFAC1[$2","$21","$5`. – Armali Oct 04 '21 at 19:01
  • Fixed that, thank you! But still does not work. awk: cmd. line:1: BEGIN { while (getline < af_ac_no_duplicates > 0) {AFAC1[$1","$2","$8] = $6" "$7 AFAC2[$1","$2","$9] = $6" "$7 } } NR==1 { print $0" AF AC" } $8=="DEL" { print $0, AFAC1[$2","$21","$5] } $8=="INS"|| $8=="SNP" { print $0, AFAC2[$2","$21","$6] } awk: cmd. line:1: ^ syntax error – Anna Oct 04 '21 at 19:08
  • If you put everything in one line (can't judge this from the comment formatting), a `;` must be put after `AFAC1[$1","$2","$8] = $6" "$7`. – Armali Oct 04 '21 at 19:10
  • ` awk 'BEGIN { while (getline < af_ac_no_duplicates > 0) {AFAC1[$1","$2","$8] = $6" "$7 ; AFAC2[$1","$2","$9] = $6" "$7 } } NR==1 { print $0" AF AC" } $8=="DEL" { print $0, AFAC1[$2","$21","$5] } $8=="INS"|| $8=="SNP" { print $0, AFAC2[$2","$21","$6] } ' "maf_unique.maf" ` awk: cmd. line:1: fatal: expression for `<' redirection has null string value – Anna Oct 04 '21 at 19:22
  • Tried to put in one line, but could not, sorry – Anna Oct 04 '21 at 19:22
  • Ah, I forgot to mention that `af_ac_no_duplicates` must be put in `"` `"` (otherwise it's interpreted as a variable name). By the way, duplicates wouldn't disturb the `awk` script, it would just take one of the value pairs. – Armali Oct 04 '21 at 19:27
  • 1
    Ok, this works now!! I am checking now -so for SNPs it worked perfectly filling in that fields, but for DEL and INS I do not have an output...I am confused why it does not work for them – Anna Oct 04 '21 at 19:45
  • By _I do not have an output_ do you mean you haven't even the original lines from "maf_unique.maf", or just that there are no AF and AC values appended? Unfortunately since I don't have your data, I can't test this and only say that it worked with the small example data from the question post. – Armali Oct 04 '21 at 19:54
  • 1
    I have the original file, but without values appended – Anna Oct 04 '21 at 20:13
  • 1
    I am more than happy to share snippet of the two files, if you have a time for that – Anna Oct 04 '21 at 20:14
  • 1
    You already helped me so much! – Anna Oct 04 '21 at 20:14
  • 1
    Or may be I can post here 10 lines as an examples with all types of varients ( SNP. DEL , INS) and there correspondence values in AF and AC? . Would that work? Thank you – Anna Oct 04 '21 at 20:16
  • That might help, yes. You could add the data to the question post for better formatting. – Armali Oct 04 '21 at 20:27
  • 1
    Already did that. Thank you! – Anna Oct 04 '21 at 20:39
  • In your `awk` command, field $21 (column `vcf_pos`) is used, but in the _real data_ you posted all but one value rows in File 1 have only 20 columns with no `vcf_pos`. - This is my last message for today - maybe till tomorrow! – Armali Oct 04 '21 at 21:38
  • 1
    Ah ok, I think I know the reason why in awk the command is not successful. When there are some of the not-filled values in the columns it makes less columns in a row. While this is not a problem when I open my file in R – Anna Oct 05 '21 at 06:32
  • 1
    It is perfectly working with awk -F'\t' in the beginning! Thank you – Anna Oct 05 '21 at 07:57
  • Ah, sorry, I didn't copy the data in a way that would have preserved the TABs. I'm glad to hear that it works, and you're welcome. – Armali Oct 05 '21 at 08:37
  • 1
    Thanks a lot! May I only ask one last thing. I just want to check which position not match in two files. Why does this command not to work? awk -F'\t' 'FNR == NR { mem[ $2 $21] = 1 ;print ; next } { key = $1 $2 ;if ( ! ( key in mem) ) print } ' maf_unique.maf af_ac_no_duplicates >out – Anna Oct 05 '21 at 11:16
  • (I take the `print` of every line from the first file is only for debugging.) This command does work for lines with TAB separated fields. In the sample data you posted lastly, file 2 has some lines with no TABs, but multiple spaces between fields, namely the header line and the three last lines. Is this also the case with the data you use? – Armali Oct 05 '21 at 15:10
  • Already fixed that with sed 's/ /\t/g' – Anna Oct 05 '21 at 16:40
  • But this didn't solve the problem, did it? I'd think that it won't work to replace every single space with a TAB, at least not with the sample data, where there are multiple spaces between two columns to be replaced by a single TAB. – Armali Oct 05 '21 at 17:55
  • 1
    It works well and I have tab-separated file – Anna Oct 06 '21 at 03:48
  • It's good to hear that, even if I don't understand why :-) - except if in the data you used there were only single spaces between columns, then it's understandable that replacing each with TAB fixed it. – Armali Oct 06 '21 at 07:16