How can I plot p-values for SNPs that are spread across thousands of scaffolds on a single continuous axis?

Question

I have association mapping derived P-values for SNPs that are scattered across thousands of scaffolds in a non-model organism. I would like plot the P-value of each SNP on a Manhattan-style plot. I do not care about the order of the scaffolds, but I would like to retain the relative order and spacing of SNP positions on their respective scaffolds. I simply want to visualize roughly how many genomic regions are significantly associated with a phenotype. For example:

My data looks something like this:

SCAFFOLD    POSITION
1           8967    
1           8986    
1           9002    
1           9025    
1           9064    
2           60995   
2           61091   
2           61642   
2           61898   
2           61921   
2           62034   
2           62133   
2           62202   
2           62219   
2           62220   
3           731894  
3           731907  
3           731962  
3           731999  
3           732000  
3           732050  
3           732076  
3           732097

I would like to write a perl code to create a third column that retains the distance between SNPs on the same scaffold, while arbitrarily spacing scaffolds by some number (100 in the following example):

SCAFFOLD    POSITION    CONTINUOUS_AXIS
1           8967        8967
1           8986        8986
1           9002        9002
1           9025        9025
1           9064        9064
2           60995       9164
2           61091       9260
2           61642       9811
2           61898       10067
2           61921       10090
2           62034       10203
2           62133       10302
2           62202       10371
2           62219       10388
2           62220       10389
3           731894      10489
3           731907      10502
3           731962      10557
3           731999      10594
3           732000      10595
3           732050      10645
3           732076      10671
3           732097      10692

Thank you to anyone who might have a good strategy.

score 0 · Accepted Answer · answered Apr 22 '16 at 04:08

Something like the following should work:

#!/usr/bin/env perl

use strict;
use warnings;

use constant SCAFFOLD_SPACING => 100;

my ($last_scaffold, $last_position, $continuous_axis, $found_data);

my $input = './input';
open my $fh, "<$input"
    or die "Unable to open '$input' for reading : $!";

print join( "\t", qw( SCAFFOLD POSITION CONTINUOUS_AXIS ) ) . "\n"; # Output Header
while (<$fh>) {
    next unless m|\d|; # Skip non-data lines

    my ($scaffold, $position) = split /\s+/; # Split on whitespace

    unless ($found_data++) {
        # Initialize
        $last_scaffold   = $scaffold; # Set to first data value
        $last_position   = $position; # Set to first data value
        $continuous_axis = $position; # Start continuous axis at first position
    }

    my $position_diff = $position - $last_position;
    my $scaffold_diff = $scaffold - $last_scaffold;

    if ($scaffold_diff == 0) {
        $continuous_axis += $position_diff;
    } else {
        $continuous_axis += SCAFFOLD_SPACING;
    }
    print join( "\t", $scaffold, $position, $continuous_axis ) . "\n";

    # Update
    $last_scaffold = $scaffold;
    $last_position = $position;
}

This did exactly what I was hoping for. Thank you! – Joe_McGirr Apr 24 '16 at 22:08 — Joe_McGirr, Apr 24 '16 at 22:08

How can I plot p-values for SNPs that are spread across thousands of scaffolds on a single continuous axis?

1 Answers1