0

I've been working on this for a while and I'm getting pretty frustrated with it. Basically, I have two text documents that both have important information. The first document contains some information I want to extract (chromatin name, start point, and end point), and I want to use this information to search for information in the second text (I want to count the atgc for each chunk defined by the start and end point). So I am trying to extract the start-end sequence numbers, and then use those to chunk and count the frequencies of atcg for each of the frequencies. I feel like I am getting close, but my biggest problem is how can I use the start and end points I extracted from the first text, and use them as start and end points in making chunks in the second?

Here is what I have so far:

from __future__ import division
import nltk, re, pprint, subprocess

f = open('first_text.txt') #this text has chromatin name, start/end points
raw = f.read()
raw = read.lower ()
l = raw.splitlines() #these next few lines are just for formatting
l = [re.sub(r'\t', '', l) for l in l] #and getting rid of stuff I don't want

datas = []
for elem in l:
    datas.append(elem.strip().split(' '))

wanted_stuff = []
for datas in datas:
    wanted_stuff.append(datas[0:3]) #extracting chromatin name, start, end
    # and making a list of [name, start, end]'s.
    # for example: ['chr1', '10000', '106000'] is on one line, etc. 
    # next line is another ['chrx', 'start number', 'end number'], and so on

chroms = []
starts = []
ends = []
for wanted_stuff in wanted_stuff:
    chroms.append(wanted_stuff[0])
    starts.append(wanted_stuff[1])
    ends.append(wanted_stuff[2])    

start_stop = [slice(int(starts), int(stops)) for chroms, starts, stops in wanted_stuff]

print start_stop # ValueError: too many values to unpack

f.close()

f = open('dna.txt')
fdna = f.read()
fdna = fdna.lower()
format1 = re.sub(r'chr, '', fdna) #getting rid of stuff I don't want
my_format = re.sub(r'[^atcg]', '', format1)

# SOME KIND OF CHUNKING MAGIC HERE?!?!?!

total = len(my_format)
n_bits =  my_format.count('n')
a_bits =  my_format.count('a')
t_bits =  my_format.count('t')
g_bits =  my_format.count('g')
c_bits =  my_format.count('c')

def percentage(count, total):
    return 100 * count / total
f.close() 

Right now this just prints a long list of numbers, counting how many a's there are in every chunk of 600 characters. However, I want to figure out how to define these chunks by what I have as the results of my first_text. (I.e. for the result "chrom1, 10000, 10600", in the second part of my code I want 10000 to the the start, 10600 to be the end, and then loop through all of the starts and ends, to count "a" in every trunk. If I could return a result like, "Chrom1, chunk 10000 - 10600 has 175 a's", I would be so happy!

Can anyone help me out? I'm not a very good programmer... I know some of my code is redundant. Anyway, any input is much appreciated!!

EDIT to clear up some things: The extraction of the start and end points is working. If I

print wanted_data

My results are

"['Chrom1', '10000', '10600'], ['Chrom1', '10600', '12300'], ['Chrom1', '12300', '17000'], ['Chrom1', '17000', '21000]', ...."

many more. The first number in each one is the start point (e.g. 10000). The second point is the end point in each set (e.g. 10600)

Edit - the start and end points should be the start and end points of the chunks. So I want to use 10000 and 106000 to find format2[10000:106000] and count the a's in this chunk, and then do this for all of the starts and ends I get.

SnarkShark
  • 360
  • 1
  • 7
  • 20
  • The *extraction* of the start and end points is working? Please provide an brief example of the extracted data. – wwii Jul 27 '15 at 02:34
  • In the loop, ```for datas in datas:``` you overwrite ```wanted_stuff``` each iteration - it will only contain the last *item* when the iteration completes. – wwii Jul 27 '15 at 02:38
  • What are the "starts and ends"? Character offsets into the second file? Line offsets? Something else? Can you point us towards an example of the data you're working with? – larsks Jul 27 '15 at 02:41
  • I edited and clarified things at the bottom. Meanwhile, wwii, if I need to keep all of them, should I append the results onto an empty list perhaps? – SnarkShark Jul 27 '15 at 02:55
  • @wwii, thank you for noticing this. I appended it to an empty list and will edit my original post for this. – SnarkShark Jul 27 '15 at 03:12
  • This cannot be your actual code as you there are some syntax errors in it that wouldn't work and you'ld be asking us about - yhere is a missing right paren in ```l = [re.sub(r'\t', '', l] for l in l]``` – wwii Jul 27 '15 at 03:21
  • What is this `raw = read.lower ()` and why are you importing `nltk`? – Burhan Khalid Jul 27 '15 at 03:23
  • I don't have internet on my programming computer, so I had to copy it off the screen and type it by hand... sorry. I'll double check to make sure I didn't miss anything. – SnarkShark Jul 27 '15 at 03:24
  • Burhan Khalid,raw = read.lower() makes everything lowecase. I think I need nltk to do that? – SnarkShark Jul 27 '15 at 03:25
  • If the start is 5 and the end is 8, do you want ```5,6,7``` or ```5,6,7,8```?? – wwii Jul 27 '15 at 03:27
  • If have a chunk [5:8], I think in python it always gives just 5,6,7, right? Just normal like this is good for me, so I'm not worried about adding anything to the start and end points. Does that make sense? – SnarkShark Jul 27 '15 at 03:30

1 Answers1

0

A bit of cleanup; you can remove all the tabs from raw before you split it.

raw = read.lower ()
raw = re.sub(r'\t', '', raw)
l = raw.splitlines() #these next few lines are just for formatting

You want to save all of the extracted data instead - here is one way with a list.

wanted_stuff = []
for datas in datas:
    wanted_stuff.append(data[0:3]) #extracting chromatin name, start, end
# or as a list comprehension
#wanted_stuff = [data[:3] for data in datas]

If your data looks like this:

a = [['Chrom1', '10000', '10600'], ['Chrom1', '10600', '12300'],
     ['Chrom1', '12300', '17000'], ['Chrom1', '17000', '21000']]

You can create a list of slice objects:

start_stop = [slice(int(start), int(stop)) for c, start, stop in a]
#if you need to conserve resources or time, use a generator expression
#start_stop = (slice(int(start), int(stop)) for c, start, stop in a)

That used a list comprehension to create start_stop. Written as a traditional for loop it looks like this:

start_stop = []
for c, start, stop in a:
    start_stop.append(slice(start, stop))

This uses unpacking to separate the items:

>>> thing = ['Chrom1', '10000', '10600']
>>> c, start, stop = thing
>>> c
'Chrom1'
>>> start, stop
('10000', '10600')
>>>

slice objects can be used to index into a sequence:

>>> s = 'abcdefg'
>>> items_123 = slice(1, 4)
>>> s[items_123]
'bcd'
>>> s[1:4]
'bcd'
>>> 

start_stop is a list that contains slice objects which can be used extract the desired text from format2

chunks = [format2[chunk] for chunk in start_stop]

Expanded as a for loop:

chunks = []
for chunk in start_stop:
    chunks.append(format2[chunk])

You seem to have forgotten to close the second file. You can avoid that mistake by using files in a context manager using the with keyword:

with open('dna.txt') as f:
    fdna = f.read()

fdna = fdna.lower()
...
wwii
  • 23,232
  • 7
  • 37
  • 77
  • Thanks for this! I've cleaned up the parts you suggested (I'll update the original post after this) but I've still got some questions. First, when I "print list wanted_stuff" I get the right format, but when I just "print wanted_stuff" I get what you have shown. Is there a specific reason for that? I'm also still learning how to use generator expressions, so that is something I will try! I don't exactly understand the line of code for defining start_stop. How does it take the numbers and make them starts/stops? also, what is "c" and "a". since I'm looking for dna, i don't want to use atcg – SnarkShark Jul 27 '15 at 04:05
  • Also when I run it with this addition, I get an error. Edit: line 23 in start_stop = [slice(int(start), int(stop)) for c, start, stop in a] NameError: name 'a' is not defined. – SnarkShark Jul 27 '15 at 04:09
  • @SnarkShark, ```name 'a' is not defined.``` in my answer, the example data is assigned to ```a``` so that is what I used in the code - you need to adjust for the name of your data. [```slice```](https://docs.python.org/3/library/functions.html#slice) .. [slice object](https://docs.python.org/3/glossary.html#term-slice). you may also be interested in [```collections.Counter```](https://docs.python.org/3/library/collections.html#collections.Counter) – wwii Jul 27 '15 at 04:24
  • Unfortunantly, I still don't really understand that line of code and how, for example, I can map 10000 on to start and 106000 onto end. I read that line of code as, "the thing named start_stop is a pair of some integer named started and another integer named stop, and... ". I don't understand the "for c, start, stop in a". For my purposes would it be something like, "for chunk, start in start, stop in end"...but then I need to figure out how to loop it as well. – SnarkShark Jul 27 '15 at 04:35
  • Thanks for your update! I'm still stuck on getting a list of slice objects and confused about "c", since you its not defined in your code or mine. "a" would be the equivalent to "wanted_stuff", right. I get the error "ValueError: too many values to unpack." – SnarkShark Jul 28 '15 at 06:08
  • Edit: I cant put my code in the commends so I will edit my code above right now – SnarkShark Jul 28 '15 at 06:09
  • @SnarkShark - perhaps you should *play* with your updated code a bit and use some ```print``` statements/functions to take a look at the result of some of your processes then post another question concerning the problems with the updated code. Try to restrict the question(s) to specific problems and if possible create code-snippits that illustrate the problem you are having - [mcve](http://stackoverflow.com/help/mcve). You may want to start learning the [Python Debugger](https://docs.python.org/3/library/pdb.html) in case printing isn't sufficient. – wwii Jul 28 '15 at 15:54
  • Thanks for your advice and help. I will do that and hopefully I can figure it out based on the guidance you gave me. Thank you! – SnarkShark Jul 29 '15 at 01:51