Binning Together Allele Frequencies From VCF Sequencing Data

Question

I have a sequencing datafile containing base pair locations from the genome, that looks like the following example:

chr1 814 G A 0.5
chr1 815 T A 0.3
chr1 816 C G 0.2
chr2 315 A T 0.3
chr2 319 T C 0.8
chr2 340 G C 0.3
chr4 514 A G 0.5

I would like to compare certain groups defined by the location of the bp found in column 2. I then want the average of the numbers in column 5 of the matching regions.

So, using the example above lets say I am looking for the average of the 5th column for all samples spanning chr1 810-820 and chr2 310-330. The first five rows should be identified, and their 5th column numbers should be averaged, which equals 0.42.

I tried creating an array of ranges and then using awk to call these locations, but have been unsuccessful. Thanks in advance.

How will you provide the inputs? Will you have multiple entries from column1? — vikramls, Apr 06 '15 at 16:52
If by multiple entries you mean multiple columns could match for each range, then yes. There might be any number of matches for a given region. — The Nightman, Apr 06 '15 at 16:53
Actually maybe you are asking if column 1 could change so that chr1 or chr2 could match the same numerical range from column 2. The answer to this is no. I would like a unique match per chromosome so something needs to match both chr1 in column 1 and a range of 800-850 in column 2 for example. — The Nightman, Apr 06 '15 at 16:55

Dyno Fu · Accepted Answer · 2015-04-06T17:41:13.457

import pandas as pd
from StringIO import StringIO

s = """chr1 814 G A 0.5
chr1 815 T A 0.3
chr1 816 C G 0.2
chr2 315 A T 0.3
chr2 319 T C 0.8
chr2 340 G C 0.3
chr4 514 A G 0.5"""

sio = StringIO(s)
df = pd.read_table(sio, sep=" ", header=None)
df.columns=["a", "b", "c", "d", "e"]

# The query expression is intuitive 
r = df.query("(a=='chr1' & 810<b<820) | (a=='chr2' & 310<b<330)")
print r["e"].mean()

pandas might be better for such tabular data processing, and it's python.

score 1 · Answer 2 · answered Apr 06 '15 at 17:09

Here's some python code to do what you are asking for. It assumes that your data lives in a text file called 'data.txt'

#!/usr/bin/env python

data = open('data.txt').readlines()
def avg(keys):
    key_sum = 0
    key_count = 0
    for item in data:
        fields = item.split()
        krange = keys.get(fields[0], None)
        if krange:
            r = int(fields[1])
            if krange[0] <= r and r <= krange[1]:
                key_sum += float(fields[-1])
                key_count += 1
    print key_sum/key_count

keys = {} # Create dict to store keys and ranges of interest
keys['chr1'] = (810, 820)
keys['chr2'] = (310, 330)

avg(keys)

Sample Output:

0.42

score 1 · Answer 3 · answered Apr 06 '15 at 18:58

Here's an awk script answer. For input, I created a 2nd file which I called ranges:

chr1 810 820
chr2 310 330

The script itself looks like:

#!/usr/bin/awk -f

FNR==NR { low_r[$1] = $2; high_r[$1] = $3; next }

{ l = low_r[ $1 ]; h = high_r[$1]; if( l=="" ) next }

$2 >= l && $2 <= h { total+=$5; cnt++ }

END {
        if( cnt > 0 ) print (total/cnt)
        else print "no matched data"
}

Where the breakdown is like:

FNR==NR - absorb the ranges file, making a low_r and high_r array keyed off of the first column in that file.
Then for every row in the data, lookup matches in the low_r and high_r array. If there's no match, then skip any other processing
Check an inclusive range based on low and high testing, incrementing total and cnt for matched ranges.
At the END, print the simple averages when there were matches

When the script (called script.awk) is made executable it can be run like:

$ ./script.awk ranges data
0.42

where I've called the data file data.

Binning Together Allele Frequencies From VCF Sequencing Data

3 Answers3

Sample Output: