I'm trying to make a simple web crawler using trio
an asks
. I use nursery to start a couple of crawlers at once, and memory channel to maintain a list of urls to visit.
Each crawler receives clones of both ends of that channel, so they can grab a url (via receive_channel), read it, find and add new urls to be visited (via send_channel).
async def main():
send_channel, receive_channel = trio.open_memory_channel(math.inf)
async with trio.open_nursery() as nursery:
async with send_channel, receive_channel:
nursery.start_soon(crawler, send_channel.clone(), receive_channel.clone())
nursery.start_soon(crawler, send_channel.clone(), receive_channel.clone())
nursery.start_soon(crawler, send_channel.clone(), receive_channel.clone())
async def crawler(send_channel, receive_channel):
async for url in receive_channel: # I'm a consumer!
content = await ...
urls_found = ...
for u in urls_found:
await send_channel.send(u) # I'm a producer too!
In this scenario the consumers are the producers. How to stop everything gracefully?
The conditions for shutting everything down are:
- channel is empty
- AND
- all crawlers are stuck at the first for loop, waiting for the url to appear in receive_channel (which... won't happen anymore)
I tried with async with send_channel
inside crawler()
but could not find a good way to do it. I also tried to find some different approach (some memory-channel-bound worker pool, etc), no luck here as well.