I've been working on this for a while and I'm getting pretty frustrated with it. Basically, I have two text documents that both have important information. The first document contains some information I want to extract (chromatin name, start point, and end point), and I want to use this information to search for information in the second text (I want to count the atgc for each chunk defined by the start and end point). So I am trying to extract the start-end sequence numbers, and then use those to chunk and count the frequencies of atcg for each of the frequencies. I feel like I am getting close, but my biggest problem is how can I use the start and end points I extracted from the first text, and use them as start and end points in making chunks in the second?
Here is what I have so far:
from __future__ import division
import nltk, re, pprint, subprocess
f = open('first_text.txt') #this text has chromatin name, start/end points
raw = f.read()
raw = read.lower ()
l = raw.splitlines() #these next few lines are just for formatting
l = [re.sub(r'\t', '', l) for l in l] #and getting rid of stuff I don't want
datas = []
for elem in l:
datas.append(elem.strip().split(' '))
wanted_stuff = []
for datas in datas:
wanted_stuff.append(datas[0:3]) #extracting chromatin name, start, end
# and making a list of [name, start, end]'s.
# for example: ['chr1', '10000', '106000'] is on one line, etc.
# next line is another ['chrx', 'start number', 'end number'], and so on
chroms = []
starts = []
ends = []
for wanted_stuff in wanted_stuff:
chroms.append(wanted_stuff[0])
starts.append(wanted_stuff[1])
ends.append(wanted_stuff[2])
start_stop = [slice(int(starts), int(stops)) for chroms, starts, stops in wanted_stuff]
print start_stop # ValueError: too many values to unpack
f.close()
f = open('dna.txt')
fdna = f.read()
fdna = fdna.lower()
format1 = re.sub(r'chr, '', fdna) #getting rid of stuff I don't want
my_format = re.sub(r'[^atcg]', '', format1)
# SOME KIND OF CHUNKING MAGIC HERE?!?!?!
total = len(my_format)
n_bits = my_format.count('n')
a_bits = my_format.count('a')
t_bits = my_format.count('t')
g_bits = my_format.count('g')
c_bits = my_format.count('c')
def percentage(count, total):
return 100 * count / total
f.close()
Right now this just prints a long list of numbers, counting how many a's there are in every chunk of 600 characters. However, I want to figure out how to define these chunks by what I have as the results of my first_text. (I.e. for the result "chrom1, 10000, 10600", in the second part of my code I want 10000 to the the start, 10600 to be the end, and then loop through all of the starts and ends, to count "a" in every trunk. If I could return a result like, "Chrom1, chunk 10000 - 10600 has 175 a's", I would be so happy!
Can anyone help me out? I'm not a very good programmer... I know some of my code is redundant. Anyway, any input is much appreciated!!
EDIT to clear up some things: The extraction of the start and end points is working. If I
print wanted_data
My results are
"['Chrom1', '10000', '10600'], ['Chrom1', '10600', '12300'], ['Chrom1', '12300', '17000'], ['Chrom1', '17000', '21000]', ...."
many more. The first number in each one is the start point (e.g. 10000). The second point is the end point in each set (e.g. 10600)
Edit - the start and end points should be the start and end points of the chunks. So I want to use 10000 and 106000 to find format2[10000:106000] and count the a's in this chunk, and then do this for all of the starts and ends I get.