0

I have a list called:

FirstSequenceToSplit

and it contains one item, which is a DNA sequence say:

'ATTTTACGTA'

I can return the length of this item easily, so the user knows that it is 10 characters long, and what I then want to do is for the user to say they want to extract the characters of index say [0:6], and to then produce two items in a new list. The first item to have the characters of the user defined index followed by a question mark replacing the other characters that weren't extracted and the second item to have the inverse.

So to illustrate what I want, if the user said they wanted [0:5] you would get a new list with the following items:

['ATTTT?????','?????ACGTA']

This is all part of a much larger problem where I have a set of DNA sequences in FASTA format ('>Sequence1/nATTTTACGTA', '>Sequence2/nATTGCACGTA' etc) and I want the user to be able to choose a sequence based on its ID and for that sequence to be split based on the predefined input and to be called Sequence2a and Sequence2b ('>Sequence1a/n?????ACGTA', '>Sequence1b/nATTTT?????''>Sequence2/nATTGCACGTA' etc). I have currently solved the problem by printing the names of the sequences, letting the user choose one to splice extracting just the sequence (without the ID) and then once I solve the issue shown above I will create a new list with the new items.

As I am a beginner (as I am sure is obvious by now!) I would appreciate any explanations of code given. Thank you so much for any possible help you can give

My code so far is:

import sys
import re

#Creating format so more user friendly

class color:
   PURPLE = '\033[95m'
   CYAN = '\033[96m'
   DARKCYAN = '\033[36m'
   BLUE = '\033[94m'
   GREEN = '\033[92m'
   YELLOW = '\033[93m'
   RED = '\033[91m'
   BOLD = '\033[94m'
   UNDERLINE = '\033[4m'
   END = '\033[0m'


fileName = raw_input("Give the name of the Fasta file you wish to divide up  ")
# i.e TopTenFasta

#Reading in the sequences splitting them by the > symbol
in_file = open(fileName,"r")
sequences = in_file.read().split('>')[1:] 
in_file.close() 


#Putting all these sequences into a list
allSequences = []
for item in sequences:
    allSequences.append(item)

#Letting you know how many sequences there are in total
NumberOfSequences = len(allSequences)
print color.BOLD + "The Number of Sequences in this list is: " +color.END, NumberOfSequences

#Returning the names of the IDs to allow you to decide which ones to split
SequenceIds = []
for x in allSequences:
    SequenceIds.append(x[0:10])

print color.BOLD + "With the following names: " + color.END, "\n", "\n".join(SequenceIds)

#-----------------------Starting the Splice ------------------------------------
#-----------------------------------------------------------------------------
#------------------------------------------------------------------------------



#Choosing the sequence you wish to splice 
FirstSequenceToSplitID = raw_input(color.BOLD + "Which sequence would you like to splice  " + color.END)

#Seeing whether that item is in the list
for x in SequenceIds:
    if FirstSequenceToSplitID == x:
        print "valid input"

FirstSequenceToSplit = []

#making a new list (FirstSequenceToSplit) and putting into it just the sequence (no ID)
for listItem in allSequences:
    if listItem[0:10]==FirstSequenceToSplitID:
        FirstSequenceToSplit.append(listItem[11:])

#Printing the Length of the sequence to splice
for element in FirstSequenceToSplit:
    print color.BOLD + "The Length of this sequence is" + color.END, len(element)
PaulBarr
  • 919
  • 6
  • 19
  • 33
  • Hi I didnt include my code as I wanted to focus on the main question of my post but I have edited it in to show what i've done so far, it is probably very long winded as I have only started coding recently so apologies! – PaulBarr Apr 10 '14 at 18:11

2 Answers2

1

I would use comprehensions and zip. I've commented the code, but feel free to ask if something is unclear.

my_str = 'ATTTTACGTA'

# This loop will check that 
#  - the casting to int is ok
#  - there are only two numbers inputted
#  - stop >= start
#  - start > 0
#  - stop < len(my_str)
while True:
    try:
        start, stop = map(int, raw_input(
            'Please enter start and stop index separated by whitespace\n').split())
        if stop < start or start < 0 or stop > len(my_str):
            raise ValueError
        break
    except ValueError:
        print 'Bad input, try again'


# Loop over all chars, check if the current index is inside range(start, stop).
# If it is, add (char, '?') to the array, if not, add ('?', char) to the array.
#
# This would give you an array of something like this:
# [('?', 'A'), ('?', 'T'), ('T', '?'), ('T', '?'), ('?', 'T'), ('?', 'A'),
#  ('?', 'C'), ('?', 'G'), ('?', 'T'), ('?', 'A')]
#
# By using zip(*array), we unpack each element, and saves the first indexes as
# one list, and the second indexes as another, giving you a list like this:
#
# [('?', '?', 'T', 'T', '?', '?', '?', '?', '?', '?'),
#  ('A', 'T', '?', '?', 'T', 'A', 'C', 'G', 'T', 'A')]

chars = zip(*((c, '?') if i in range(start, stop) else ('?', c)
              for i, c in enumerate(my_str)))

# ''.join is used to concencate all chars into two strings
my_lst = [''.join(s) for s in chars]
print my_lst

Sample output:

Please enter start and stop index separated by whitespace
4
Bad input, try again
Please enter start and stop index separated by whitespace
5 4
Bad input, try again
Please enter start and stop index separated by whitespace
e 3
Bad input, try again
Please enter start and stop index separated by whitespace
4 5
['????T?????', 'ATTT?ACGTA']
Steinar Lima
  • 7,644
  • 2
  • 39
  • 40
  • 1
    Thank you so much, I am going to implement this into my code and will ask once I have done so, thank you for your clear help! It helps me learn as well as solve my problems! – PaulBarr Apr 10 '14 at 18:31
  • I have been trying to get it to work but at the moment I am getting a list with ['ATTTTACGTA', '?'] regardless of my inputs. I also had to change the code : if stop < start or start < 0 or stop > len(my_str): raise ValueError as the length of the list is 1 (1 item), does it matter that im doing this on a list rather than a string? – PaulBarr Apr 10 '14 at 21:21
  • I solved this by saving the list item as a string so the code wasnt working on the list itself, thankyou! – PaulBarr Apr 10 '14 at 21:29
0

This expression will work:

[ c[0:n] + '?' * (len(c)-n), '?' * n + c[n:] ]
Michael Lorton
  • 43,060
  • 26
  • 103
  • 144
  • @PaulBarr Just to clarify `c` is your string, `n` is the index you want to split at. – photoionized Apr 10 '14 at 18:10
  • Thankyou, is there anyway that I could split between two indexes, such as [2:6] for example? – PaulBarr Apr 10 '14 at 18:13
  • @PaulBarr assuming you want three strings then, modify the above like this: `[ c[0:n] + '?' * (len(c)-n), '?' * n + c[n:m] + '?' * (len(c)-m), '?' * m + c[m:] ]`. It's pretty simple math. – photoionized Apr 10 '14 at 18:17
  • instead of '?' * (len(c)-n), you do something like the following, `{0:09d}".format(str)` – AdriVelaz Apr 10 '14 at 18:17
  • I mean with still two strings, i.e. having ['???TT?????','ATT??ACGTA'], apologies for the basic questions but I havent been able to solve this by myself so far – PaulBarr Apr 10 '14 at 18:20
  • @PaulBarr Ok, that's just the middle expression in the above comment then, `'?' * n + c[n:m] + '?' * (len(c)-m)` it's essentially just making a slice of the string between indexes n and m and then adding the proper number of '?' to the string for padding. – photoionized Apr 10 '14 at 18:22