-1

I want to impute marker classes (either class A or class B), based on proximity of known marker classes. So for example if I know M1 and M4 are class A, then all markers positioned in the map between M1 and M4 can also be classified as A.

If I know marker M4 is class A and its position is chr1 13, and marker M7 is B with position 16, then we can classify all markers with position less than equal to (13+16)/2=14.5 as A and everything between 14.5 and 16 as B on the same chromosome. So M5 is A and M6 can be classified as B.

I have a map of sorted positions of markers

M0  chr1    9
M1  chr1    10
M2  chr1    11
M3  chr1    12
M4  chr1    13
M5  chr1    14
M6  chr1    15
M7  chr1    16
M8  chr2    1
M9  chr2    2
M10 chr2    3
M11 chr2    4

So given a simple backbone of

M1  A
M4  A
M7  B
M8  B
M10 A

I want to impute the rest of the markers on the map, if possible.

So my desired output is

M1  A
M2  A
M3  A
M4  A
M5  A
M6  B
M7  B
M8  B
M9  B
M10 A

I am a biologist trying to learn a little bit of awk, and relaize this maybe just a computational problem and I`m not sure where to get started. Please help. I have access to unix cluster to run awk and perl. Please note, correct imputation can be only done between markers mapped to the same chromosome.

  • 1
    You really ought to try something, especially if you are trying to learn awk. – Borodin Sep 29 '16 at 21:48
  • What if an unknown falls on the midpoint between a known A and a known B? For instance, in your example data, if M5 were known to be A and M7 were B, what would you assign to M6? Would it become unknown like M0 and M11? – Borodin Sep 29 '16 at 21:56
  • Since you have sorted data, half the work is already done. You need to process each sequence between successive known classes one by one, assigning values to the unknowns accordingly. So M1 through to M4 is trivial, then M4 to M7 slightly less so. You need to check whether an interval crosses a chromosome boundary like M7 to M8, in which case nothing can be imputed, and finally M8 to M10 is straightforward. – Borodin Sep 29 '16 at 22:04
  • sorry, there were some major power outages, due to storms and I couldnt get back to you in time. ..the markers right in between can be classified either way, in the example i mentioned less than equal to the mean as the first classification. Than – Alpesh Querer Sep 30 '16 at 14:19
  • Let me do some testing with me code and get back to you. I will also ask you about the code logic if I cant follow it myself, will be a great leanrning experience. Thanks again. – Alpesh Querer Sep 30 '16 at 14:20

1 Answers1

0

You never answered any of my questions, so here's a Perl solution that makes a lot of guesses

use strict;
use warnings 'all';
use autodie;

my (@markers, %markers);
{
    open my $fh, '<', 'markers.txt';

    while ( <$fh> ) {
        my @marker = split;
        push @markers, \@marker;
        $markers{$marker[0]} = $#markers;
    }
}

my ($i0, $i1);

open my $fh, '<', 'classes.txt';

while ( <$fh> ) {

    my ($marker, $class) = split;

    $i1 = $markers{$marker};
    my $m1 = $markers[$i1];
    push @$m1, $class;

    next unless defined $i0;

    my $m0 = $markers[$i0];

    next if $m0->[1] ne $m1->[1];          # Different chromosomes

    my $mid = ( $m0->[2] + $m1->[2] ) / 2; # Mid point between markers

    for my $m ( @markers[ $i0+1 .. $i1-1 ] ) {
        push @$m, $m->[2] <= $mid ? $m0->[3] : $m1->[3];
    }
}
continue {
    $i0 = $i1;
}

printf "%-4s%-8s%-4d%-s\n", @{$_}[0..2], $_->[3] // '' for @markers;

output

M0  chr1    9   
M1  chr1    10  A
M2  chr1    11  A
M3  chr1    12  A
M4  chr1    13  A
M5  chr1    14  A
M6  chr1    15  B
M7  chr1    16  B
M8  chr2    1   B
M9  chr2    2   B
M10 chr2    3   A
M11 chr2    4   
Borodin
  • 126,100
  • 9
  • 70
  • 144