2

I am trying to increase the amount of requests per sec. I am currently rocking python 2.7 and able to get approx 1 request per second. Do I need to multi-thread / multi-process the function or asynchronously run multiple instances of the func. I have no idea on how to make this work. Please help :-)

while True:
    r = requests.post(url, allow_redirects=False, data={
        str(formDataNameLogin): username,
        str(formDataNamePass): password,
    })

    print 'Sending username: %s with password %s' % (username, password)
Raoslaw Szamszur
  • 1,723
  • 12
  • 21
Ewy
  • 51
  • 1
  • 5

2 Answers2

1

Just use any async library. I think an asynchronous verions of requests, such as grequest, txrequests, requests-futures and requests-threads would work best for your. Below a code sample from grequests' readme file:

import grequests

urls = [
    'http://www.heroku.com',
    'http://python-tablib.org',
    'http://httpbin.org',
    'http://python-requests.org',
    'http://fakedomain/',
    'http://kennethreitz.com'
]

Create a set of unsent Requests:

rs = (grequests.get(u) for u in urls)

Send them all at the same time:

grequests.map(rs)

Using or learning other mentioned modules, say requests-threads, might be slightly more involving, especially with python 2

from twisted.internet.defer import inlineCallbacks
from twisted.internet.task import react
from requests_threads import AsyncSession

session = AsyncSession(n=100)

@inlineCallbacks
def main(reactor):
    responses = []
    for i in range(100):
        responses.append(session.get('http://httpbin.org/get'))

    for response in responses:
        r = yield response
        print(r)

if __name__ == '__main__':
    react(main)

asyncio and aiohttp might be even more noteworthy, but, I guess, it would be easier to learn a version of an already familiar module.

Multithreading is unnecessary, but you can try mutithreading or, perhaps even better, multiptrocessing, and see which performs best.

Serge
  • 3,387
  • 3
  • 16
  • 34
  • @SilverSlash async could be in multithread or not, the point being that async is a way to make multiple IO call without waiting for their results. So in facts you can get much faster results using AsyncIO than with multithreading in some cases as the code can do other stuff while waiting for a read execution to finish. – Loïc Faure-Lacroix Oct 31 '18 at 16:23
  • there is a cost to create an object and thread, while theads do not bring any extra computational power (just allow to utilize a processor more efficiently) so yes, async is likely to outperform multithreading. Multiprocessing might allow to use extra cores – Serge Oct 31 '18 at 16:43
  • @LoïcFaure-Lacroix True, I was too fixated on my task when I asked the question so I got a little confused. For my task, I need to download images in bulk AND resize and save them to disk. So I believe multithreading is better suited for me. – Silver Nov 01 '18 at 05:11
  • then multi processing might work for you even better – Serge Nov 01 '18 at 14:26
  • @Serge Multiple threads could also take advantage of multiple cores, so there is no need for `multiprocessing` for that reason. Also, since one thread is mainly waiting for data to be downloaded anyway (and IO releases the GIL), it is likely threads can perform as good as `multiprocessing`. However, if the processing takes more resources and needs to be split on more threads/cores, then the GIL will impose restrictions and it might be better to move to `multiprocessing`. – JohanL Nov 03 '18 at 15:04
  • @JohanL No, your are wrong. standard cpython threads are limited to one core, due to GIL https://stackoverflow.com/questions/7542957/is-python-capable-of-running-on-multiple-cores . Also the topic starter is doing image resizing – Serge Nov 03 '18 at 15:46
  • @Serge Please re-read my comment and the answer to the question you are referring to again. I/O releases the GIL, as well as a number of modules implemented in C (most prominently `numpy` and friends). In these situations, threads are not bound by the GIL. And downloading is I/O, so if one thread is downloading, the other is free to run on another core. The same is true, if one thread is doing heavy, long-running `numpy` operations. – JohanL Nov 03 '18 at 15:56
  • Yup I believe you got it wrong. Any references (for IO operations allowing multicore?) Let's not involve numpy, obviosly heavy computing libraries have their way around GIL. – Serge Nov 03 '18 at 16:25
  • Actually the official doc say nothing about cores, perhaps it allows IO operation run on other core. Perhaps you are right. Official doc just say to access any object a thread should hold lock, perhaps if no object is accessed threads can interleave or execute on several cores. – Serge Nov 03 '18 at 17:19
  • Stillhe GIL can degrade performance even when it is not a bottleneck. The system call overhead is significant, especially on multicore hardware. Two threads calling a function may take twice as much time as a single thread calling the function twice. The GIL can cause I/O-bound threads to be scheduled ahead of CPU-bound threads. And it prevents signals from being delivered. (from python wiki) – Serge Nov 03 '18 at 17:25
  • @Serge Yes, there are definitely a cost connected to multithreading and the GIL even when it is not totally blocking. But, multiprocessing is also not free. What is best needs to be evaluated case-by-case, but even so, multithreading *can* very well be useful also to take advantage of multiple cores, in *some* situations and this, with one thread being repsonsible for I/O, can very well be one of those. – JohanL Nov 03 '18 at 21:32
  • @JohanL what you're looking for is probably a mix of asyncio and multi processing. You can have all of the tasks run in one thread in one process to download asynchronuously all the files. Then each time a file is downloaded start a new process that will resize/transform the picture. To your main process this will all be considered as IO operation so you can wait for the result by either writing the picture to stdout or simply returning 0 with a defined output file. Multithread will gain you nothing. – Loïc Faure-Lacroix Nov 06 '18 at 03:32
  • 1
    Unless you want to share data between threads there is not much gain over just spawning new processes even more if the time to start the process takes less time than the job it has to do. It's all relative to the task you're trying to achieve but in theory, you really don't need threads because. Your threads really don't share data. So you can have a process that download and spawn process to resize. – Loïc Faure-Lacroix Nov 06 '18 at 03:38
  • @LoïcFaure-Lacroix This is not really the place for this discussion, but obvoiusly there will be a need for the threads to pass data if one thread/process is downloading and another one is re-scaling the downloaded image. Anyway, I do not doubt that there are ways to do this asynchronously - that is not what I have been arguing. I just wanted to make clear that mutlithreading is a viable option for speed increase, even with the GIL. I have even written a short test program, that for my certain use case reduces the run time by one third using threads (compared to NOT using threads). – JohanL Nov 07 '18 at 05:41
  • @Serge See my comment above; in my particular test case, downloading and resizing a certain image, it is possible to reduce the run time one third using threading (compared not to use threading). Also, looking at CPU usage for the function it exceeds 100 %, meaning that it employs multiple cores. – JohanL Nov 07 '18 at 05:43
  • For this function - you mean by the python interpreter process? What libraries do you use? – Serge Nov 07 '18 at 11:15
  • Running two threads simultaneously seems to contradict to docs, that say that only one pythonic thread owning GIL can execute. But we cannot completely exclude some error in code, doc or common interpretation of the python documentation. – Serge Nov 07 '18 at 15:30
1

You can do multiple parallel requests like so using multithreading:

import Queue
import threading
import time
import requests

exit_flag = 0

class RequestThread(threading.Thread):
    def __init__(self, thread_id, name, q):
        threading.Thread.__init__(self)
        self.thread_id = thread_id
        self.name = name
        self.q = q
    def run(self):
        print("Starting {0:s}".format(self.name))
        process_data(self.name, self.q)
        print("Exiting {0:s}".format(self.name))

def process_data(thread_name, q):
    while not exit_flag:
        queue_lock.acquire()
        if not qork_queue.empty():
            data = q.get()
            queue_lock.release()
            print("{0:s} processing {1:s}".format(thread_name, data))
            response = requests.get(data)
            print(response)
        else:
            queue_lock.release()
        time.sleep(1)

thread_list = ["Thread-1", "Thread-2", "Thread-3"]
request_list = [
    "https://api.github.com/events",
    "http://api.plos.org/search?q=title:THREAD",
    "http://api.plos.org/search?q=title:DNA",
    "http://api.plos.org/search?q=title:PYTHON",
    "http://api.plos.org/search?q=title:JAVA"
]
queue_lock = threading.Lock()
qork_queue = Queue.Queue(10)
threads = []
thread_id = 1

# Create new threads
for t_name in thread_list:
    thread = RequestThread(thread_id, t_name, qork_queue)
    thread.start()
    threads.append(thread)
    thread_id += 1

# Fill the queue
queue_lock.acquire()
for word in request_list:
    qork_queue.put(word)
queue_lock.release()

# Wait for queue to empty
while not qork_queue.empty():
    pass

# Notify threads it's time to exit
exit_flag = 1

# Wait for all threads to complete
for t in threads:
    t.join()

print("Exiting Main Thread")

Output:

Starting Thread-1
Starting Thread-2
Starting Thread-3
Thread-1 processing https://api.github.com/events
Thread-2 processing http://api.plos.org/search?q=title:THREAD
Thread-3 processing http://api.plos.org/search?q=title:DNA
<Response [200]>
<Response [200]>
<Response [200]>
Thread-2 processing http://api.plos.org/search?q=title:PYTHON
Thread-3 processing http://api.plos.org/search?q=title:JAVA
Exiting Thread-1
<Response [200]>
<Response [200]>
Exiting Thread-3
Exiting Thread-2
Exiting Main Thread

A little explanation although I'm no multithreading expert:

1.Queue

The Queue module allows you to create a new queue object that can hold a specific number of items. There are following methods to control the Queue:

  • get() − removes and returns an item from the queue.
  • put() − adds an item to a queue. qsize() − returns the number of items that are currently in the queue.
  • empty() − returns True if queue is empty; otherwise, False.
  • full() − returns True if queue is full; otherwise, False.

For my little experience with multithreading, this is useful to control what data you have still to process. I had situations where threads were doing the same thing or all exited except one. This helped me to control shared data to process.

2.Lock

The threading module provided with Python includes a simple-to-implement locking mechanism that allows you to synchronize threads. A new lock is created by calling the Lock() method, which returns the new lock.

A primitive lock is in one of two states, “locked” or “unlocked”. It is created in the unlocked state. It has two basic methods, acquire() and release(). When the state is unlocked, acquire() changes the state to locked and returns immediately. When the state is locked, acquire() blocks until a call to release() in another thread changes it to unlocked, then the acquire() call resets it to locked and returns. The release() method should only be called in the locked state; it changes the state to unlocked and returns immediately. If an attempt is made to release an unlocked lock, a ThreadError will be raised.

To more human language locks are the most fundamental synchronization mechanism provided by the threading module. At any time, a lock can be held by a single thread, or by no thread at all. If a thread attempts to hold a lock that’s already held by some other thread, execution of the first thread is halted until the lock is released.

Locks are typically used to synchronize access to a shared resource. For each shared resource, create a Lock object. When you need to access the resource, call acquire to hold the lock (this will wait for the lock to be released, if necessary), and call release to release it.

3.Thread

To implement a new thread using the threading module, you have to do the following:

  • Define a new subclass of the Thread class.
  • Override the init(self [,args]) method to add additional arguments.
  • Then, override the run(self [,args]) method to implement what the thread should do when started.

Once you have created the new Thread subclass, you can create an instance of it and then start a new thread by invoking the start(), which in turn calls run() method. Methods:

  • run() − method is the entry point for a thread.
  • start() − method starts a thread by calling the run method.
  • join([time]) − waits for threads to terminate.
  • isAlive() − method checks whether a thread is still executing.
  • getName() − returns the name of a thread.
  • setName() − sets the name of a thread.

Is it really faster?

Using single thread:

$ time python single.py 
Processing request url: https://api.github.com/events
<Response [200]>
Processing request url: http://api.plos.org/search?q=title:THREAD
<Response [200]>
Processing request url: http://api.plos.org/search?q=title:DNA
<Response [200]>
Processing request url: http://api.plos.org/search?q=title:PYTHON
<Response [200]>
Processing request url: http://api.plos.org/search?q=title:JAVA
<Response [200]>
Exiting Main Thread

real    0m22.310s
user    0m0.096s
sys 0m0.022s

Using 3 threads:

Starting Thread-1
Starting Thread-2
Starting Thread-3
Thread-3 processing https://api.github.com/events
Thread-1 processing http://api.plos.org/search?q=title:THREAD
Thread-2 processing http://api.plos.org/search?q=title:DNA
<Response [200]>
<Response [200]>
<Response [200]>
Thread-1 processing http://api.plos.org/search?q=title:PYTHON
Thread-2 processing http://api.plos.org/search?q=title:JAVA
Exiting Thread-3
<Response [200]>
<Response [200]>
Exiting Thread-1
 Exiting Thread-2
Exiting Main Thread

real    0m11.726s
user    0m6.692s
sys 0m0.028s

Using 5 threads:

time python multi.py 
Starting Thread-1
Starting Thread-2
Starting Thread-3
 Starting Thread-4
Starting Thread-5
Thread-5 processing https://api.github.com/events
Thread-1 processing http://api.plos.org/search?q=title:THREAD
Thread-2 processing http://api.plos.org/search?q=title:DNA
Thread-3 processing http://api.plos.org/search?q=title:PYTHONThread-4 processing http://api.plos.org/search?q=title:JAVA

<Response [200]>
<Response [200]>
 <Response [200]>
<Response [200]>
<Response [200]>
Exiting Thread-5
Exiting Thread-4
Exiting Thread-2
Exiting Thread-3
Exiting Thread-1
Exiting Main Thread

real    0m6.446s
user    0m1.104s
sys 0m0.029s

Almost 4 times faster for 5 threads. And those are only some 5 dummy requests. Imagine for a bigger chunk of data.

Please note: I've only have tested it under python 2.7 For python 3.x minor adjustments are probably needed.

Raoslaw Szamszur
  • 1,723
  • 12
  • 21
  • That's not bad, but you should probably look at asyncio. A single thread is only slow because you're waiting for IO but it could be fast if the requests were launched asynchronously in and result would be fetched in an event loop. – Loïc Faure-Lacroix Oct 31 '18 at 16:21
  • @LoïcFaure-Lacroix There is always more than just one way to do something :). I will look into asyncio aswell, thanks for advice. I wonder tho which would be faster multithreading or asynchronous requests. – Raoslaw Szamszur Oct 31 '18 at 16:37