How to force multiprocessing on gRPC client-server framework for web crawling?

Question

I am trying to build a web-crawler on python using gRPC. I have included the functions for crawling in the server file and I use the client to request a list of URLs from the user and send it to the server for the scraping part. Each url takes about 25-30 seconds to get scraped. So, I want to use multiprocessing to speed up the process i.e. extracting information from N URLs using N cores in parallel. How do I proceed? Say, I have 4 cores: is it possible to implement 4 calls of the client to the server on 4 different cores? Or should I create a server-client pair separately on each core? Or can I create 4 server instances with different channel ports and execute them on 4 cores?

I am new to all this. So, I could use any kind of help on this.

score 0 · Answer 1 · answered Jan 23 '19 at 20:14

The simple anser is that you can start four gRPC server process using the same port. gRPC's tcp port turns on SO_REUSEPORT socket option by default, which means all traffic comes to the shared port will be load balanced to all listening processes. In this way, the gRPC server are capable of utilize all the computational power without the constrain of GIL.

If you prefer running the servers in one process, I recommend you to use the multiprocessing library (doc).

As for gRPC client, the load is not that high, you may use threading to achieve parallelism, or you may use multiprocessing to utilize more CPU.

How to force multiprocessing on gRPC client-server framework for web crawling?

1 Answers1