1

I'm trying to build an array where each row contains k-mers (k length nucleotide strings) from a different sequence. I've been reading that you can't really have empty arrays and it's been difficult for me to try using append.

bases = ['A', 'T', 'C', 'G']
self.profile = np.array([])

    for x in range(1):
        k = self.ksize
        kmer = [''.join(p) for p in itertools.product(bases, repeat=k)]
        for i in range(0, len(self.motifs)):
            for q in range(0, len(kmer)):
                if kmer[q] in self.motifs[i]:
                    self.kmers.append(kmer[q])
                    self.profile[i] = self.kmers

The error I get here is: "IndexError: index 0 is out of bounds for axis 0 with size 0"

I realize that this is because I did not specify the shape of the array, but I only know the number of rows there will be, I don't know how many columns there will be (column size depends on how many k-mers are found in each sequence).

If I try to make it a 'list of lists':

bases = ['A', 'T', 'C', 'G']
    self.profile = list()

    for x in range(1):
        k = self.ksize
        kmer = [''.join(p) for p in itertools.product(bases, repeat=k)]
        for i in range(0, len(self.motifs)):
            for q in range(0, len(kmer)):
                if kmer[q] in self.motifs[i]:
                    self.kmers.append(kmer[q])
                    self.profile[i] = self.kmers

I just get: self.profile[i] = self.kmers IndexError: list assignment index out of range

Is there a better way to do this?

eyllanesc
  • 235,170
  • 19
  • 170
  • 241
asttra
  • 179
  • 9
  • 1
    Just a note: range starts byu default at 0, so `reange(0, len(kmer))` is exactly the same as `range(len(kamer))`. – Puff Oct 16 '18 at 03:23
  • It's not quite clear what you want to archive. Your code as presented here does not reproduce the error, because, presumably, you don't have the class defined. Take a look at [mcve]. Regardless, if the arrays you are inputing into `self.profile` are not all the same length, numpy isn't what you want: [https://stackoverflow.com/questions/3386259/how-to-make-a-multidimension-numpy-array-with-a-varying-row-size]. – Puff Oct 16 '18 at 03:37
  • 1
    If I understand your code correctly, and it does what you think it does, you should be good just replacing `self.profile[i] = self.kmers` with `self.profile.append(self.kmers)`. This way, `self.profile` will contain a list of the sequences of lenght k and shorter that contain the bases specified in `self.motifs`, but not in the order given in `self.motifs`. If this is what you want, I'll post an answer with minor corrections to your code. – Puff Oct 16 '18 at 03:49
  • @MarcosWappner the reason why I wanted self.profile[i] = self.kmers is that I wanted a list of k-mers from each sequence in self.motif. If I used self.profile.append(self.kmers) it concatenates them. – asttra Oct 16 '18 at 04:10
  • @MarcosWappner, if it helps... self.motif looks like this: ['AAAAAACAGACGAAAAACTTAAAAACCCACAAAAACAATCAAAACCAGCC' 'AAACCTACAAAAAACTTAAAAACCCACAAAAACCAACAAAACCAGCCCCA' 'CAAAAAGCAGACGAAAAACTTAAAAACCCACAAAAACCATCAAAACCAGC' 'TACACAAGAAAAACTTAAAAACCCACAAAAAACCATGACCAGGAACCCCA' 'CGAAACATTTATAAAAGAGCAAAAACCAAAATCCCAAATAAAAACCCACA' 'CGAAACATTTATAAAAGAGCAAAAACCAAAATCCCAAATAAAAACCCACA' 'CGAAACATTTATAAAAGAGCAAAAACTAAGATCCCAAATAAAAACCCACA' 'AAAAAACAGACGAAAAACTTAAAAACACACAAAAACCGCAAAAACCAGCC'] – asttra Oct 16 '18 at 04:12
  • @MarcosWappner I want self.profile to be a list of lists containing k-mers from each 50 nucleotide string. – asttra Oct 16 '18 at 04:13
  • 1
    You won't be concatenating them because self.kmer is a list (which you should initialize somewhere). Or rather, I assume it is a list from the fact that you do `self.kmer.append(kmer[q])`. Lists can contain any type of oject and mix and match as you want. self.kmers will be a (growing) list and self.profile will contain all the stages of that list. – Puff Oct 16 '18 at 04:14
  • @MarcosWappner yes, self.kmer is a list that I initialized up top in my code. I just looked at my output again and you're right. They are not being concatenated. Thank you. – asttra Oct 16 '18 at 04:17

2 Answers2

2

Gathering info from the comments, I think what you want is the following: given a list of motifs (in your case, nucleotide strings each 50 bases long), you want the sub-sequences (k-mers) of lenght k that appear in each one. The more pythonic way to write your code would be:

bases = ['A', 'T', 'C', 'G']
self.profile = []

k = self.ksize
kmer = [''.join(p) for p in itertools.product(bases, repeat=k)]

for mot in self.motifs:
   for km in kmer:
      if km in mot:
         self.kmers.append(km)
         self.profile.append(self.kmers)

Note that in python you don't need to loop over the indexes if you are only going to use it to access lists, arrays, or any iterable; you can just loop over the iterable itself. Check zip and enumerate for more flexibility.

One last thing: note that self.kmer will be a list containging [kmer1, kmer2, kmer4, kmer6] and so on (i.e., the kmers in yor motif), but you won't be able to discriminate between motifs. Also, self.profile will be a list of lists containing [[kmer1], [kmer1, kmer2], [kmer1, kmer2, kmer4]], and so on.

If you don't care about self.profile (because you can actully build it later), you can do everything with a big list comprehension:

kmers = [km for mot in motifs for km in kmer if km in mot]

EDIT: two extra things

Note that this way, kmers will have repeated sequences. To avoid that you can either write an extra cheack (if km not in self.kmers), or rather use sets, to avoid repetition.

If you do want the list of kmers separated by motifs, you can do it in a simpler way with list comprehension:

self.profile = [] 
for mot in motifs:
    individual_km = [km for km in kmer if km in mot]
    self.profile.append(individual_km)
Puff
  • 503
  • 1
  • 4
  • 14
  • How would I write it so that I could do this whole process like: In the first nucleotide of self.motifs, find the k-mers, store k-mers in list, put that list in another list.... and then do this process again for the next nucleotide string in the self.motifs-- appending each list along the way? This way my data is corresponding (i.e. self.motifs[0] goes with self.kmers[0] which goes with self.profile[0])? – asttra Oct 16 '18 at 05:01
  • 1
    If I understand correctly your question, see the edited answer. However, I'm not sure what you want self.profile to contain. – Puff Oct 16 '18 at 05:05
  • I want self.profile to be a list of lists which contain the k-mers found in each nucleotide string from self.motifs. ie. self.profile = [ [k-mers from self.motif[0] ], [k-mers from self.motif[1] ], [k-mers from self.motif[ 2] ],.... etc ] – asttra Oct 16 '18 at 05:08
  • So when I tried your code, I got that the length of self.motifs was different than the length of self.profile. There are 8 strings in self.motifs, so there should be only 8 lists of k-mers within self.profile. Advise? – asttra Oct 16 '18 at 05:24
  • 1
    I do get an list of lenth 8 for `self.profile`. Did you make sure to properly intialize with an empty list? If you are using spyder or something like that, try restarting the kernel. – Puff Oct 16 '18 at 05:27
  • 1
    Yes, I initialized with an empty list. I'm using PyCharm and I saved my changes just be sure. Still getting different numbers. My output for len(self.profile) is like 312. edit: I think this might be an issue with my code somewhere else. Thank you for your help. – asttra Oct 16 '18 at 05:36
  • Just back to say I fixed my issue and now it works wonderfully. Thank you! – asttra Oct 16 '18 at 06:05
  • 1
    Cool. I hope you not only solved your issue, but also learned some python. – Puff Oct 16 '18 at 06:17
1

Numpy arrays are not well equipped to grow dynamically like python lists and dictionaries. If fact, last I read, to arbitrarily grow Numpy arrays, a new array is created to the desired shape, and then a copy is made from the original array object, which is not overly optimal.

To achieve the results you're after, I've had to create a nested list objects first, and then create the Numpy array all at once after the iterations are complete. As longs as the size of the nested list objects are equal, you can then just use something like:

my_profile = []

... your looping code ...

self.profile = np.array(my_profile)