0

I have code which is given below:

for pileupcolumn in samfile.pileup(max_depth = 1000000) :
  X.append(pileupcolumn.n)
  for pileupread in pileupcolumn.pileups:
     if (pileupread.alignment.mapping_quality <= 15):
             continue      
     if not pileupread.is_del and not pileupread.is_refskip:
             if pileupread.alignment.query_qualities[pileupread.query_position] < 30:
             # Skip entries with base phred scores < 10
                  continue
             if pileupread.alignment.is_reverse: #negative
                  ReverseList[pileupcolumn.pos] += pileupread.alignment.query_sequence[pileupread.query_position]
             else:
                  ForwardList[pileupcolumn.pos] += pileupread.alignment.query_sequence[pileupread.query_position]

The above code is taking a lot of time and I want to replace concatenation in the 11th and 13th line with join. Is there any way to do so?

  • Is there any reason to believe that the concatenation is using a noticeable amount of time compared to everything else? Using `join` on that is not going to be straight-forward. – Mark Ransom Sep 29 '17 at 22:36
  • @MarkRansom Since this is bioinformatics, the strings are probably very long (DNA sequences), and repeated concatenation is very likely to be a bottleneck, since it's O(n^2). – Barmar Sep 29 '17 at 22:43
  • `join` is O(n), because it can allocate the final result string once, and then copy all the inputs into it. – Barmar Sep 29 '17 at 22:44

1 Answers1

1

Instead of concatenating, collect the values into a list, then join the list at the end of the loop.

for pileupcolumn in samfile.pileup(max_depth = 1000000) :
  X.append(pileupcolumn.n)
  forward = []
  reverse = []
  for pileupread in pileupcolumn.pileups:
     if (pileupread.alignment.mapping_quality <= 15):
             continue      
     if not pileupread.is_del and not pileupread.is_refskip:
             if pileupread.alignment.query_qualities[pileupread.query_position] < 30:
             # Skip entries with base phred scores < 10
                  continue
             if pileupread.alignment.is_reverse: #negative
                  reverse.append(pileupread.alignment.query_sequence[pileupread.query_position])
             else:
                  forward.append(pileupread.alignment.query_sequence[pileupread.query_position])
  ReverseList[pileupcolumn.pos] += ''.join(reverse)
  ForwardList[pileupcolumn.pos] += ''.join(forward)

I still use concatenation at the end, because this optimization only works for the for pileupread loop. If different pileupcolumn objects have the same pos, we need to concatenate at that point. We also need this concatenation if the ReverseList elements already have values before this code runs.

Barmar
  • 741,623
  • 53
  • 500
  • 612
  • your solution is not working it is giving following error: File "getbases.py", line 33, in ForwardList,ReverseList = getbases(bamfile) File "getbases.py", line 25, in getbases ReverseList[pileupcolumn.pos] +=reverse.join() IndexError: list index out of range – Ammar Sabir Cheema Oct 05 '17 at 06:14
  • That's a problem with the part of the script you didn't show, where `ReverseList` and `ForwardList` are initialized. – Barmar Oct 05 '17 at 15:33
  • The error means that `pileupcolumn.pos` is bigger than the length of `ReverseList` – Barmar Oct 06 '17 at 19:48
  • Sorry for that, the two lists i.e ReverseList and ForwardList are initialized with the length of the reference which is can be equal to 16569 or 16571 in following way: ReverseList =[' '] * lenref and ForwardList = [' '] *lenref – Ammar Sabir Cheema Oct 06 '17 at 19:50
  • And what's the value of `pileupcolumn.pos` when the error happens? – Barmar Oct 06 '17 at 19:51
  • My code is assigning to the same element of ReverseList that your code does. I don't see how it could get an error when yours doesn't. – Barmar Oct 06 '17 at 19:58
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/156125/discussion-between-ammar-sabir-cheema-and-barmar). – Ammar Sabir Cheema Oct 06 '17 at 20:00
  • So previously I was using your solution by creating empty ReverseList and ForwardList but after when I initiated lists with length 16569 as mentioned above: ReverseList =[' '] * lenref and ForwardList = [' '] *lenref then I am getting following error : ReverseList[pileupcolumn.pos] +=reverse.join() AttributeError: 'list' object has no attribute 'join' – Ammar Sabir Cheema Oct 06 '17 at 20:20
  • Sorry, I was using Javascript join syntax, not Python. – Barmar Oct 06 '17 at 20:42
  • So is there any solution to this problem? – Ammar Sabir Cheema Oct 07 '17 at 02:34
  • What's wrong with my solution once I fixed the incorrect way of calling `join()`? – Barmar Oct 07 '17 at 06:01
  • If this solution works then there is no issue but its not working – Ammar Sabir Cheema Oct 07 '17 at 06:55
  • What problem is it having now? That should have fixed the error about `list object has no attribute join` – Barmar Oct 07 '17 at 06:56
  • OK let me check it again – Ammar Sabir Cheema Oct 07 '17 at 06:58
  • I tried to run it , but the system get hanged after running the above solution. – Ammar Sabir Cheema Oct 08 '17 at 07:16
  • I don't see how this can hang, there are no potential infinite loops. But there's no way for me to test it myself. Add print statements so you can see what it's doing. – Barmar Oct 09 '17 at 19:10