2

I have a list of strings . I wanted to get all items in the list that starts with a character '&' and store them on a different list .. I tried looping each item and I found out that it is slow especially when the length of list is very large. Is there other ways to do it?

here is what I did.

myList=['& apple','^ banana','& pineapple']
newList=[]

for v in myList:
    if v.startswith('&'):
        newList.append(v)

thanks for reading ...

Ivan Kolesnikov
  • 1,787
  • 1
  • 29
  • 45

4 Answers4

5

There is a shorter way which is not necessarily faster: a list comprehension.

myList = ['& apple','^ banana','& pineapple']
newList = [v for v in myList if v.startswith('&')]

If you are doing lots of data processing, pandas may be faster:

import pandas
df = pandas.DataFrame({'myList': myList})
newList = df[df.myList.str.startswith('&')]

Finally, if this is too slow, it may be useful to do the categorization while you are building the first list. If it's very large, you're probably reading it from a file or doing computations, so you can build newList in that loop.

chthonicdaemon
  • 19,180
  • 2
  • 52
  • 66
  • 1
    This is an example of a perfect answer. Just great, thanks for sharing :) – developer_hatch May 11 '17 at 04:47
  • 1
    Using a generator won't protect you from having to iterate through all the elements. So if you use a couple of generators and iterate through them at the same time it can be faster than using three list comprehensions because you save the time of iterating, but iterating through a generator is not faster than iterating through a list. – chthonicdaemon May 11 '17 at 05:13
  • 1
    That syntax isn't quite right.... shouldn't the `if` be after the `for`? – tdelaney May 11 '17 at 05:29
4

Lets put the proposed solutions to the test - and add my own contribution which, blush blush, handily beats the rest. This test script is organized slowest to fastest. I use a list that is 50% matches. The original and yield cases perform worse the higher percentage of matches.

from timeit import timeit

# generate test with half matching
myList = ['&abc', 'def'] * 10000

def orig():
    newList = []
    for v in myList:
        if v.startswith('&'):
            newList.append(v)

print('orig')
print(timeit("orig()", setup="from __main__ import orig", number=5000))

def doyield():
    def inner(myList):
        for v in myList:
            if v.startswith('&'):
                yield v
    newList = list(inner(myList))

print('yield')
print(timeit("doyield()", setup="from __main__ import doyield", number=5000))

def comp1():
    newList = [v for v in myList if v.startswith('&')]

print('comprehension with startswith')
print(timeit("comp1()", setup="from __main__ import comp1", number=5000))

def comp2():
    newList = [v for v in myList if v and v[0]=='&']

print('comprehension with compare')
print(timeit("comp2()", setup="from __main__ import comp2", number=5000))

On my junky laptop:

orig
55.8570241928
yield
50.6004090309
comprehension with startswith
47.4232199192
comprehension with compare
24.5065619946

UPDATE

Lets throw the cython compiler at the problem and see what happens. Note, you have to install cython... easy on systems that have it packaged, otherwise checkout http://cython.org. I broke the tests into a separate module and have my new test module compile them on import. There are other ways to compile cython modules (see setup.py info on the cython web site), but this way is easy for this test. The downside is that the code compiles during the first run (you need to wait awhile) and may error out.

Interestingly, the original code runs much better - but the 4th is fastest by far.

test.py

from timeit import timeit

# generate test with half matching
myList = ['&abc', 'def'] * 10000

def orig():
    newList = []
    for v in myList:
        if v.startswith('&'):
            newList.append(v)

def doyield():
    def inner(myList):
        for v in myList:
            if v.startswith('&'):
                yield v
    newList = list(inner(myList))

def comp1():
    newList = [v for v in myList if v.startswith('&')]

def comp2():
    newList = [v for v in myList if v and v[0]=='&']

cytest.py

import pyximport
pyximport.install(pyimport = True)
from test2 import *
from timeit import timeit


print('orig')
print(timeit("orig()", setup="from __main__ import orig", number=5000))

print('yield')
print(timeit("doyield()", setup="from __main__ import doyield", number=5000))

print('comprehension with startswith')
print(timeit("comp1()", setup="from __main__ import comp1", number=5000))

print('comprehension with compare')
print(timeit("comp2()", setup="from __main__ import comp2", number=5000))

Running it I get

orig
29.2698290348
yield
31.7977068424
comprehension with startswith
29.41690588
comprehension with compare
4.69118189812
tdelaney
  • 73,364
  • 6
  • 83
  • 116
2

You could yield it instead to append them to a new list, like this:

myList=['& apple','^ banana','& pineapple']

def generates():
    for v in myList:
        if v.startswith('&'):
            yield v

then you can do things with the generator, like:

mygenerator = generates()
   for i in mygenerator:
   #do something
developer_hatch
  • 15,898
  • 3
  • 42
  • 75
2

Assuming that there are no single quote (') in the strings ...

from re import findall

myList=['& apple','^ banana','& pineapple']

newList=findall(r'&[^\']*',str(myList))