5

i have two files one with positions information and another is sequence information. Now i need to read the positions and take the snps at the positions and replace that position base with the snp information in the sequence and write it in the snp information file.. for example

Snp file contains

10 A C A/C

Sequence file contains

ATCGAACTCTACATTAC

Here 10th element is T so i will replace T with [A/C] so the final output should be

10 A C A/C ATCGAACTC[A/C]ACATTAC

Example files are

Snp file

SNP Ref Alt
10  A   C
19  G   C
30  C   T
42  A   G

Sequence :

sequence1 CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT

Final output :

SNP Ref Alt Output
10  A   C   CTAGAATCA[A/C]AGCAAGAANACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
19  G   C   CTAGAATCANAGCAAGAA[C/G]ACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
30  C   T   CTAGAATCAAAGCAAGAATACACTCTTTT[T/C]TTTGGAAAAGAATATCTCATGTTTGCTCTT
42  A   G   CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGA[A/G]TATCTCATGTTTGCTCTT

While replacing the snps here from Ref and Alt column, we need to consider the order of {A,T,C,G} like the [Ref/Alt] always the first base should be either A or T or C and followed by that order.

Another thing is if we take the snp position, and if there are any snps in 10 bases difference, we need to replace that snp position with "N". In the above example in first two positions as the difference is 9 we are replacing the other element with 'N'.

I have written code where it prints the positions in order and also replaces the sequence with that snp position but unable to read nearby positions and replace with N.

My approach might be completely wrong as i am beginner in coding.. I think by using hashes, we might achieve this easily but i am not so familiar with hashes..help with some suggestions please.. i dont have to stick with only perl,

my $input_file = $ARGV[0];
my $snp_file = $ARGV[1];
my $output_file = $ARGV[2];

%sequence_hash = ();

open SNP, $snp_file || die $!;
$indel_count = 0;
$snp_count = 0;
$total_count = 0;

#### hashes and array
@id_array = ();

while ($line = <SNP>) {

    next if $line =~ /^No/;
    $line =~ s/\r|\n//g;


   # if ($line =~ /\tINDEL/) {

    #    $indel_count++;
     #   $snp_type = "INDEL";

    #} else {
     #   $snp_count++;
      #  $snp_type = "SNP";
    #}

    @parts = split(/\t/,$line);

    $id = $parts[0];
    $pos = $parts[1];
    #$ref_base = $parts[3];
    @temp_ref = split(",",$parts[2]);
    $ref_base = $temp_ref[0];
    @alt = split(",",$parts[3]);
    $alt_base = $alt[0];
    $snpformat = '';

    if ($ref_base eq "A" || $alt_base eq "A")
    {

        if ($ref_base eq "A"){
            $snpformat = "[".join("/",$ref_base,$alt_base)."]";}
            #$snpformat = $ref_base/$alt_base;}
            #print "refbase is A $ref_base \t $alt_base \t $snpformat \n"; }

            else 
            {$snpformat = "[".join("/",$alt_base,$ref_base)."]";}
            #print "Altbase is A $ref_base \t $alt_base \t $snpformat \n";}
    }

    elsif ($ref_base eq "T" || $alt_base eq "T")
    {

        if ($ref_base eq "T"){
            $snpformat = "[".join("/",$ref_base,$alt_base)."]";}
            #$snpformat = $ref_base/$alt_base;}
            #print "refbase is A $ref_base \t $alt_base \t $snpformat \n"; }

            else 
            {$snpformat = "[".join("/",$alt_base,$ref_base)."]";}
            #print "Altbase is A $ref_base \t $alt_base \t $snpformat \n";}
    }

    elsif ($ref_base eq "C" || $alt_base eq "C")
    {

        if ($ref_base eq "C"){
            $snpformat = "[".join("/",$ref_base,$alt_base)."]";}
            #$snpformat = $ref_base/$alt_base;}
            #print "refbase is A $ref_base \t $alt_base \t $snpformat \n"; }

            else 
            {$snpformat = "[".join("/",$alt_base,$ref_base)."]";}
            #print "Altbase is A $ref_base \t $alt_base \t $snpformat \n";}
    }


    else 
    {$snpformat = "-/-" ;}
    print " $id \t $pos \t $ref_base \t $alt_base \t $snpformat \n  ";
}

open SEQ, $input_file ||die $!;

$header = '';
$sequence = '';
$num_sequences = 0;

while ($line = <SEQ>) {

    $line =~ s/\r|\n//g;
    next if $line =~ //;

    if ($line =~ /^>(.+)/) {
        if ($header eq '') {

            $header = $1;
            $sequence = '';
            next;
        } else {

            $sequence_len = length($sequence);

            $sequence_hash{$header} = $sequence;
            push (@headers,$header);
            #print $header."\t".$sequence_len."\n";
            #print $sequence."\n";
            $num_sequences++;

            $header = $1;
            $sequence = '';

        }


    } else {

        $sequence .= $line;

    }

}
$sequence_len = length($sequence);
$sequence_hash{$header} = $sequence;
push (@headers,$header);
#print $header."\t".$sequence_len."\n";

$num_sequences++;

close (SEQ);

$pos = '4';
substr($sequence,$pos,1) = "[A/G]";
print $sequence."\n";   
print "$pos \n";
user630605
  • 167
  • 1
  • 4
  • 10
  • Please add `use strict` and `use warnings` to your program to help you track down mistakes. Also, please go through your old questions and accept answers that you found helpful. See the [faq#hotoask] for more info on that. – simbabque Oct 29 '12 at 15:31
  • I did add the strict and warnings but while copying it got cut... i only post it for suggestions but not to get complete solution.. – user630605 Oct 29 '12 at 15:45
  • 3
    Ok. While I like your detailed explanation of the problem, it's always a good idea to add these things also in the snippets you post, so people know you are using them. There are undeclared variables in your code. We don't know if you declared them in your program in a part that you did not show. It's best if you post a [short, self-contained, correct example](http://sscce.org/). That way, it is easier to understand your problem and to help you. – simbabque Oct 29 '12 at 15:49
  • I think it's safe to say this is too big a job for sed. (Not that you *can't* do it with sed, you just *shouldn't*.) – Beta Oct 30 '12 at 06:35

3 Answers3

1

This awk script can probably help you get desired result.

awk '
BEGIN {
print "SNP\tRef\tAlt\tOutput"
}
NR==FNR { 
    a[++i]=$0
    next 
} 
FNR>1 { 
    x=substr(a[i],1,($1-1))
    z=substr(a[i],($1+1))
    if ($2=="A") {
        y="["$2"/"$3"]"
    } 
    else if ($2=="T" && $3=="A") {
        y="["$3"/"$2"]"
    }
    else if ($2=="C" && ($2=="A" || $2=="T")) {
        y="["$3"/"$2"]"
    }
    else if ($2=="G" && ($2=="A" || $2=="T" || $2=="C")) {
        y="["$3"/"$2"]"
    }
    else 
        y="["$3"/"$2"]"
    print $1"\t"$2"\t"$3"\t"x""y""z
}' sequence snp

Test:

[jaypal:~/temp] cat sequence
CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT

[jaypal:~/temp] cat snp
SNP Ref Alt
10  A   C
19  G   C
30  C   T
42  A   G

[jaypal:~/temp] awk '
BEGIN {
print "SNP\tRef\tAlt\tOutput"
}
NR==FNR { 
    a[++i]=$0
    next 
} 
FNR>1 { 
    x=substr(a[i],1,($1-1))
    z=substr(a[i],($1+1))
    if ($2=="A") {
        y="["$2"/"$3"]"
    } 
    else if ($2=="T" && $3=="A") {
        y="["$3"/"$2"]"
    }
    else if ($2=="C" && ($2=="A" || $2=="T")) {
        y="["$3"/"$2"]"
    }
    else if ($2=="G" && ($2=="A" || $2=="T" || $2=="C")) {
        y="["$3"/"$2"]"
    }
    else 
        y="["$3"/"$2"]"
        print $1"\t"$2"\t"$3"\t"x""y""z
}' sequence snp
SNP Ref Alt Output
10  A   C   CTAGAATCA[A/C]AGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
19  G   C   CTAGAATCAAAGCAAGAA[C/G]ACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
30  C   T   CTAGAATCAAAGCAAGAATACACTCTTTT[T/C]TTTGGAAAAGAATATCTCATGTTTGCTCTT
42  A   G   CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGA[A/G]TATCTCATGTTTGCTCTT
jaypal singh
  • 74,723
  • 23
  • 102
  • 147
0

I am not a perl expert, but I think this will do it:

#!/usr/bin/perl

open(SEQ, "seq");
my $seq = <SEQ>;
$seq =~ s/.* //;

print "SNP Ref Alt Output\n";
open(SNP, "snp");
<SNP>;# header line
while(<SNP>)
{
    my($line) = $_;
    chomp($line);
    my ($loc, $ref, $alt) = split(/ +/, $line);
    my $outString = $seq;
    substr($outString, $loc-1, 1, "[$ref/$alt]");
    print $loc."  ".$ref."   ".$alt."   ".$outString."\n";
}
Beta
  • 96,650
  • 16
  • 149
  • 150
0
A=1;T=2;C=3;G=4
echo "SNP Ref Alt Output"
while read l1 l2 l3; do
    lp=$(($l1 - 1))
    eval ol2=\$$l2 && eval ol3=\$$l3
    if [[ $ol2 > $ol3 ]]; then 
        ol2=$l3 && ol3=$l2; 
    else 
        ol2=$l2 && ol3=$l3; 
    fi
    sed "s@[^ ]* \(.\{$lp\}\).\(.*\)@$l1  $l2   $l3   \1[$ol2\/$ol3]\2@" sequence
done < snp 

Output

SNP Ref Alt Output
10  A   C   CTAGAATCA[A/C]AGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
19  G   C   CTAGAATCAAAGCAAGAA[C/G]ACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
30  C   T   CTAGAATCAAAGCAAGAATACACTCTTTT[T/C]TTTGGAAAAGAATATCTCATGTTTGCTCTT
42  A   G   CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGA[A/G]TATCTCATGTTTGCTCTT
perreal
  • 94,503
  • 21
  • 155
  • 181
  • OP has a constraint that substitution needs to follow certain order. `While replacing the snps here from Ref and Alt column, we need to consider the order of {A,T,C,G} like the [Ref/Alt] always the first base should be either A or T or C and followed by that order.` – jaypal singh May 28 '13 at 18:09