Finding unique strings of characters in each sequence

Question

I am trying to create a program that has multiple sequences of tRNA stored as a dictionary. I have set up my code to extract and store the sequences and the specific names associated with the sequences as:

class Unique():
    def __init__(self, seq = ''):
        for s in range(len(seq)):
            for e in range(s + 1, len(seq) + 1):
                self.add(seq[s:e])
        self.head = head
        self.sequence = seq
        self.original = {}

    def cleaner(self):
        for (header, sequence) in myReader.readFasta():
            clean = sequence.replace('-','').replace('_','')
            self.original[self.head] = clean
        return self.original

    def sites(self):
        Unique.cleaner(self)

I am calling on the sites function (which is why it runs cleaner as the first step), but I am lost on how I can go about writing code to find unique strings in each stored sequence.

As an example if I have 2 sets of Sequences:

UCGUUAGC
AGCGCAUU

The program would be able to tell me that the first sequence's unique string is UCG and the second's is AGC, since UCG is ONLY present in the first sequence and AGC is only present in the second.

EDIT: What I mean by unique sequence: Any strand of the sequence I can see and automatically know which sequence it came from. So if the strand UCGA only exists in one sequence, it is counted and saved as a unique strand associated with that sequence.

The sequences extracted look like this:

GAGAGAGACAUAGAGGDUAUGAPGPPGG'UUGAACCAAUAGUAGGGGGUPCG"UUCCUUCCUUUCUUACCA

There are many unique sequences not named. You should clarify your definition of unique and a sequence. Is it always 3 characters? Can it start at any point? — Klaus D., Dec 07 '15 at 20:48
@KlausD.it doesn't necessarily have to be 3 characters, just any combination of characters that is unique to each sequence. I hope my edit clarifies what I meant — lamazibiji, Dec 07 '15 at 20:53
So, just starting with the U it would me `UC`, `UCG`, `UCGU`, `UCGUU`, , `UCGUUA`, `UCGUUAG` and `UCGUUAGC` for sequence 1? — Klaus D., Dec 07 '15 at 20:56
@KlausD. - Think you missed `UA`, `UUA`, `UUAG`, and some others :) — OneCricketeer, Dec 07 '15 at 21:00
Overall yes, that is basically it, although I would rather have it spit out the answers as `UC`, `CGU`, `GUU`, etc.. as they don't just add a letter or two to the first unique string found. But if I can't do that, thats fine too. — lamazibiji, Dec 07 '15 at 21:06
I just noticed `AGC` is in both of your sequences, so your example is a bit wrong. — OneCricketeer, Dec 07 '15 at 21:12

score 1 · Answer 1 · edited May 23 '17 at 10:27

1

So, if I understand correctly, you want all substrings of sequence A that do not exist in sequence B.

This can be easily achieved using a set complement or difference.

And I "stole" some code from another answer.

def get_all_substrings(input_string):
  length = len(input_string)
  return [input_string[i:j+1] for i in xrange(length) for j in xrange(i,length)]

# convert these to sets to remove duplicate substrings
seq1 = set(get_all_substrings('UCGUUAGC')) 
seq2 = set(get_all_substrings('AGCGCAUU'))

unique_seq1 = seq1 - seq2 # those sequences that are in seq1, and not in seq2
unique_seq2 = seq2 - seq1 # those sequences that are in seq2, and not in seq1

UPDATE: As pointed out in the comments, the get_all_substrings method I copied will eat away at memory for large strings, this version is more perfomant in that it lazily gets the next substring

def get_all_substrings(string):
    length = len(string)
    for i in xrange(length):
        for j in xrange(i + 1, length + 1):
            yield(string[i:j])

edited May 23 '17 at 10:27

Community

1
1

answered Dec 07 '15 at 21:07

OneCricketeer

179,855
19
132
245

Thank you so much for your response! Just one question: would this work in an automated way? since I have way more than 2 sequences in my actual program. So when comparing seq1, it would compare it with like 50+ other sequences. – lamazibiji Dec 07 '15 at 21:14
@PadraicCunningham - the generator version is in that link to the other post, but I think having the list readily available is okay. And `a.difference(b)` is equivalent to `a - b` – OneCricketeer Dec 07 '15 at 21:18
@cricket_007, `a.difference(get_all_substrings('AGCGCAUU')))` is not the same , if you have large amounts of data your code won't fare too well, if you want to find what is unique to both symmetric_difference would do – Padraic Cunningham Dec 07 '15 at 21:20
@PadraicCunningham - of course that isn't the same, `get_all_substrings` returns a list, not a set – OneCricketeer Dec 07 '15 at 21:21
@cricket_007 do you know how the method call actually works? You should maybe read the docs on python sets – Padraic Cunningham Dec 07 '15 at 21:23
@PadraicCunningham - yes :) It is a doubly-nested for loop with `O(n^2)` runtime that generates all substrings of the given string. The return value is a list that is stored in memory. I understand that using a generator is better and doing so would allow the use of `a.difference(get_all_substrings('AGCGCAUU')))`. Am I done being interviewed, now? – OneCricketeer Dec 07 '15 at 21:26
Thank you both for helping out. I'm trying to convert the manual code cricket provided so I can make it automated for multiple sequences. Hopefully I can get it done without a problem haha – lamazibiji Dec 07 '15 at 21:29
using a list also allows `a.difference(....`, I think pointing out that your code can be optimized greatly is not considered to be an interview, you also only needed to using parens `(input_string[i:j+1] for i in xrange(length) for j in xrange(i,length))` – Padraic Cunningham Dec 07 '15 at 21:30
@PadraicCunningham - Ah, didn't realize generator comprehension was exactly like list comprehension. Thanks for the tip! – OneCricketeer Dec 07 '15 at 21:39
@cricket_007 I tried to feed sequences into get_all_substrings using a for loop, but all the outputs of substrings are one letter long. Why is this happening? – lamazibiji Dec 07 '15 at 21:58
Not sure what you mean. Like you have a list of sequences you are iterating over and passing them into `get_all_substrings` only returns substrings of a single character? That could only happen if the sequence you give it is only one character. – OneCricketeer Dec 07 '15 at 22:08
I have all the sequences saved in a dictionary. I made a for loop that that for every sequence in the dictionary, it would run get_all_substrings with that sequence and then I told it to print the substrings. For each sequence it ran, it outputted 1 letter for the substrings. Does that make more sense? – lamazibiji Dec 07 '15 at 22:14
Should you have the sequences in a list? Not sure what your keys or values would be for your dictionary. Also make sure you are iterating over the strings rather than the characters in the strings. – OneCricketeer Dec 07 '15 at 22:24

Finding unique strings of characters in each sequence

1 Answers1