0

I have a sequencing datafile containing base pair locations from the genome, that looks like the following example:

chr1 814 G A 0.5
chr1 815 T A 0.3
chr1 816 C G 0.2
chr2 315 A T 0.3
chr2 319 T C 0.8
chr2 340 G C 0.3
chr4 514 A G 0.5

I would like to compare certain groups defined by the location of the bp found in column 2. I then want the average of the numbers in column 5 of the matching regions.

So, using the example above lets say I am looking for the average of the 5th column for all samples spanning chr1 810-820 and chr2 310-330. The first five rows should be identified, and their 5th column numbers should be averaged, which equals 0.42.

I tried creating an array of ranges and then using awk to call these locations, but have been unsuccessful. Thanks in advance.

The Nightman
  • 5,609
  • 13
  • 41
  • 74
  • Do you have to use awk for this? – vikramls Apr 06 '15 at 16:50
  • nope, that's just how I have tried doing this. – The Nightman Apr 06 '15 at 16:50
  • How will you provide the inputs? Will you have multiple entries from column1? – vikramls Apr 06 '15 at 16:52
  • If by multiple entries you mean multiple columns could match for each range, then yes. There might be any number of matches for a given region. – The Nightman Apr 06 '15 at 16:53
  • Actually maybe you are asking if column 1 could change so that chr1 or chr2 could match the same numerical range from column 2. The answer to this is no. I would like a unique match per chromosome so something needs to match both chr1 in column 1 and a range of 800-850 in column 2 for example. – The Nightman Apr 06 '15 at 16:55

3 Answers3

1
import pandas as pd
from StringIO import StringIO

s = """chr1 814 G A 0.5
chr1 815 T A 0.3
chr1 816 C G 0.2
chr2 315 A T 0.3
chr2 319 T C 0.8
chr2 340 G C 0.3
chr4 514 A G 0.5"""

sio = StringIO(s)
df = pd.read_table(sio, sep=" ", header=None)
df.columns=["a", "b", "c", "d", "e"]

# The query expression is intuitive 
r = df.query("(a=='chr1' & 810<b<820) | (a=='chr2' & 310<b<330)")
print r["e"].mean()

pandas might be better for such tabular data processing, and it's python.

Dyno Fu
  • 8,753
  • 4
  • 39
  • 64
1

Here's some python code to do what you are asking for. It assumes that your data lives in a text file called 'data.txt'

#!/usr/bin/env python

data = open('data.txt').readlines()
def avg(keys):
    key_sum = 0
    key_count = 0
    for item in data:
        fields = item.split()
        krange = keys.get(fields[0], None)
        if krange:
            r = int(fields[1])
            if krange[0] <= r and r <= krange[1]:
                key_sum += float(fields[-1])
                key_count += 1
    print key_sum/key_count

keys = {} # Create dict to store keys and ranges of interest
keys['chr1'] = (810, 820)
keys['chr2'] = (310, 330)

avg(keys)

Sample Output:

0.42
vikramls
  • 1,802
  • 1
  • 11
  • 15
1

Here's an awk script answer. For input, I created a 2nd file which I called ranges:

chr1 810 820
chr2 310 330

The script itself looks like:

#!/usr/bin/awk -f

FNR==NR { low_r[$1] = $2; high_r[$1] = $3; next }

{ l = low_r[ $1 ]; h = high_r[$1]; if( l=="" ) next }

$2 >= l && $2 <= h { total+=$5; cnt++ }

END {
        if( cnt > 0 ) print (total/cnt)
        else print "no matched data"
}

Where the breakdown is like:

  • FNR==NR - absorb the ranges file, making a low_r and high_r array keyed off of the first column in that file.
  • Then for every row in the data, lookup matches in the low_r and high_r array. If there's no match, then skip any other processing
  • Check an inclusive range based on low and high testing, incrementing total and cnt for matched ranges.
  • At the END, print the simple averages when there were matches

When the script (called script.awk) is made executable it can be run like:

$ ./script.awk ranges data
0.42

where I've called the data file data.

n0741337
  • 2,474
  • 2
  • 15
  • 15