0

I want to create a program that can fetch 100's of webpages and return their content. I can do this with a simple python script now:

import requests

urls = [...]
data = []
for url in urls:
    content = requests.get(url).content
    data.append(content)

However, the downfall to the above implementation is that when in the for loop, content must be loaded before making a request on the next url. What I want to do is avoid this. I want to make one request for each url, but not have to wait for loading the content of the current url to finish. How can I do this? I have read up on aiohttp and threading, but I am not sure what is the best approach.

Kyle DeGennaro
  • 188
  • 3
  • 12
  • 1
    The best approach depends largely on what you need to do exactly. If you are fetching a few 100 pages with low latency, threads are fine. If you are fetching some 1.000.000 pages with arbitrary latency, an async library may be beneficial. – MisterMiyagi Aug 06 '19 at 19:44

1 Answers1

1

asyncio + aiohttp is a good combination that will provide a significant performance improvement:

Sample implementation:

import asyncio
import aiohttp


async def fetch(url):
    async with aiohttp.ClientSession() as session:
        resp = await session.get(url)
        content = await resp.text()
        return content 


async def main():
    urls = [...]
    webpages = await asyncio.gather(*[fetch(url) for url in urls])
    # use webpages for further processing


loop = asyncio.get_event_loop()
loop.run_until_complete(main())
loop.close()
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105
  • Thank you for this! What does the `*` before `[fetch(url)]` do? – Kyle DeGennaro Aug 07 '19 at 11:10
  • @KyleDeGennaro It's called [argument unpacking](https://docs.python.org/3/tutorial/controlflow.html#unpacking-argument-lists). [`asyncio.gather`](https://docs.python.org/3/library/asyncio-task.html#asyncio.gather) accepts variable no. of `awaitables` as arguments. – Shiva Jan 18 '22 at 05:16