network optimizations while web crawling - using udp and using connection pooling?

Question

I'm looking at donne martin's design for a web crawler.

they are suggesting the following network optimization:

The Crawler Service can improve performance and reduce memory usage by keeping many open connections at a time, referred to as connection pooling

Switching to UDP could also boost performance

I don't understand both suggestions: what's connection pooling got to do with web crawling? isn't each crawler service opening its own connection to the host its currently crawling? what good would connection pooling do here? and about UDP - isn't crawling issuing a HTTP over TCP requests to web hosts? how is UDP relevant here?

You are correct, he is crazy. HTTP client libraries already do HTTP connection pooling for you, and HTTP over UDP does not exist in general. — user207421, May 16 '20 at 07:34

score 1 · Answer 1 · answered May 16 '20 at 04:30

what's connection pooling got to do with web crawling? isn't each crawler service opening its own connection to the host its currently crawling?

I think you are assuming that the crawler will send a request to a host only once. This is not the case, a host may have hundreds of pages that you want to crawl, and opening a connection each time is not efficient.

about UDP - isn't crawling issuing a HTTP over TCP requests to web hosts? how is UDP relevant here?

Taken from the book Web Data Mining:

The crawler needs to resolve host names in URLs to IP addresses. The connections to the Domain Name System (DNS) servers for this purpose are one of the major bottlenecks of a naïve crawler, which opens a new TCP connection to the DNS server for each URL. To address this bottleneck, the crawler can take several steps. First, it can use UDP instead of TCP as the transport protocol for DNS requests. While UDP does not guarantee delivery of packets and a request can occasionally be dropped, this is rare. On the other hand, UDP incurs no connection overhead with a significant speed-up over TCP

Any networking program that doesn't use the BSD Sockets API for DNS lookups is improperly written, and any library implementation of it that doesn't use UDP for DNS is ditto. — user207421, May 16 '20 at 07:33

network optimizations while web crawling - using udp and using connection pooling?

1 Answers1