1

I am creating a web crawler with a multiplexed download manager using Linux epoll (Linux 2.6.30.x). I pick links from a database of over 40,000 domains (each domain having between 1 and 2000 urls), a total of 250,000 urls. I multiplex the downloads so that on average I have not more than 2 parallel streams per host (as per the HTTP spec recommendation), and also so that I loop over between a batch of 10 to 50 hosts at a time. I have chosen non-blocking sockets and epoll for speed and scalability (am low on RAM) and ease of use compared to the poll, select and signal-driven I/O.

I download the first few 100s of urls very smoothly and rapidly. Trouble is, I keep getting EAGAIN/EWOULDBLOCK error from certain links (sockets) that otherwise seem ready (i.e. I can use my PC's browser to open the links at any point). But even after epolling them repeatedly expecting their status to change to EPOLLIN, they remain EAGAIN/EWOULDBLOCK. These links build-up very quickly so that I have to stop the whole download.

What really does EAGAIN/EWOULDBLOCK mean? Is EAGAIN/EWOULDBLOCK a permanent status, so that once detected I should delist that socket from any further observation?

Kindly help.

EdNdee
  • 123
  • 1
  • 4
  • Can you clarify exactly what's happening? Are you getting an `epoll` read hit or write hit? What operation is returning `EAGAIN/EWOULDBLOCK`? – David Schwartz Apr 03 '12 at 08:48
  • I've 3 threads -thread1 issues epoll_ctl(epoll_writefd, EPOLL_CTL_ADD,..) and epoll_ctl(epoll_readfd, EPOLL_CTL_ADD,..) for each live host socket -less than 50 active; thread2 issues epoll_wait(epoll_writefd,...,-1) to check write readiness, when ready the actual http request, then epoll_ctl(epoll_writefd, EPOLL_CTL_DEL,..) to remove socket from further write epoll; thread3 issues epoll_wait(epoll_readfd,..,-1) to check read readiness, when ready, download page repeatedly (until error or complete), then issues epoll_ctl(epoll_readfd, EPOLL_CTL_DEL,..) to remove socket from further read epoll. – EdNdee Apr 03 '12 at 10:28
  • Okay, so what operation returns `EWOULDBLOCK`? I think what you're missing is this: If a `read` operations returns `EWOULDBLOCK`, you don't want to try to read again until you get another `epoll` read hit. – David Schwartz Apr 03 '12 at 10:46
  • Solved! Thanks David. "If a read operations returns EWOULDBLOCK, you don't want to try to read again until you get another epoll read hit" - That's actually quite important coz the thread would then block, I hadn't initially figured that out! I appreciate your help. – EdNdee Apr 04 '12 at 12:32
  • The thread shouldn't block because you should have set the socket non-blocking. (If you want to block, why use `epoll`? And if you don't want to block, you *must* set the socket.) What will happen, though, is that the thread will spin. – David Schwartz Apr 04 '12 at 19:21

1 Answers1

0

This link shows the meaning of error codes in GNU library. EAGAIN/EWOULDBLOCK means resources temporarily unavailable. The call might work if you try later. An example is the case of non-blocking IO operation that will block.

Khaled
  • 36,533
  • 8
  • 72
  • 99