Python - Iterating through a large list and putting in queue

Question

I have the for loop code:

people = queue.Queue()
for person in set(list_):
    first_name,last_name = re.split(',| | ',person)
    people.put([first_name,last_name])

The list being iterated has 1,000,000+ items, it works, but takes a couple seconds to complete.

What changes can I make to help the processing speed?

Edit: I should add that this is Gevent's queue library

A little thing you can do is put the re outside the loop. E.g. `splitter = re.compile(r',| | ')`, then use `lastname,firstname = splitter.split(person)` instead of `re.split` — Francis Avila, Dec 05 '11 at 04:07

David K. Hess · Accepted Answer · 2011-12-05T04:00:01.643

1

The question is what is your queue being used for? If it isn't really necessary for threading purposes (or you can work around the threaded access) in this kind of situation, you want to switch to generators - you can think of them as the Python version of Unix shell pipes. So, your loop would look like:

def generate_people(list_):
    previous_row = None
    for person in sorted(list_):
        if person == previous_row:
            continue
        first_name,last_name = re.split(',| | ',person)
        yield [first_name,last_name]
        previous_row = person

and you would use this generator like this:

for first_name, last_name in generate_people():
    print first_name, last_name

This approach avoids what is probably your biggest performance hits - allocating memory to build a queue and a set with 1,000,000+ items on it. This approach works with one pair of strings at a time.

UPDATE

Based on more information about how threads play a roll in this, I'd use this solution instead.

people = queue.Queue()
previous_row = None
for person in sorted(list_):
    if person == previous_row:
        continue
    first_name,last_name = re.split(',| | ',person)
    people.put([first_name,last_name])
    previous_row = person

This replaces the set() operation with something that should be more efficient.

edited Dec 05 '11 at 04:00

answered Dec 05 '11 at 03:37

David K. Hess

16,632
2
49
73

I'm using the queue for threading since it's thread safe. I'm not sure your way of splitting will work since the regex code I have in place is used to split between multiple delimiters. I will give the approach as a generator and see if that helps. Thanks. – mikeyy Dec 05 '11 at 03:41
1

I didn't change anything about the split. Just reworked the function into a generator. – David K. Hess Dec 05 '11 at 03:42
Oh sorry, I must have read another comment with the split being changed.. weird, my apologies. – mikeyy Dec 05 '11 at 03:43
Is a thread pulling from this queue as you are adding to it? If so, then the set operation may be the real performance hit here. – David K. Hess Dec 05 '11 at 03:45
Nope, I add everything to the queue once then run through it. – mikeyy Dec 05 '11 at 03:47
So, are threads involved? It sounds like you don't need them if that's what you are doing. If you don't need them, I would ditch the queue and move to a generator. – David K. Hess Dec 05 '11 at 03:50
Threading is involved, I get a name on each thread and I use queues so I don't pull the same name multiple times since it's thread safe. – mikeyy Dec 05 '11 at 03:55
What's the previous_row doing? – mikeyy Dec 05 '11 at 04:08
It is used to perform the set operation. Creating a set from an iterable can be simulated by sorting the iterable and throwing away duplicate rows. – David K. Hess Dec 05 '11 at 04:10
Oh I see now. First I've seen it done like that, neat way to do it. – mikeyy Dec 05 '11 at 04:13

score 1 · Answer 2 · answered Dec 05 '11 at 03:40

1

with people.mutex:
    people.queue.extend(list(re.split(',| | ',person)) for person in set(list_))
    people.not_empty.notify_all()

Note that this completely ignores the queue capacity, but avoids lots of excessive locking.

answered Dec 05 '11 at 03:40

Matt Joiner

112,946
110
377
526

@mikeyy: No it won't. https://bitbucket.org/denis/gevent/src/aa97a252cf67/gevent/queue.py – Matt Joiner Dec 05 '11 at 05:30

Blender · Answer 3 · 2011-12-05T04:18:27.950

0

I would try replacing regex with something a bit less intense:

first_name, last_name = person.split(', ')

edited Dec 05 '11 at 04:18

answered Dec 05 '11 at 03:34

Blender

289,723
53
439
496

score 0 · Answer 4 · answered Dec 05 '11 at 03:45

0

I think you can use multi-threading reading data,and the queue concurrent queue.

answered Dec 05 '11 at 03:45

ttyunix

1
1

Python - Iterating through a large list and putting in queue

4 Answers4