0

I am experimenting and attempting to make a minimal web crawler. I understand the whole process at a very high level. So getting into the next layer of details, how does a program 'connect' to different websites to extract the HTML?

Am I using Sockets to connect to servers and sending http requests? Am I giving commands to the terminal to run telnet or ssh?

Also, is C++ a good language of choice for a web crawler?

Thanks!

CodeKingPlusPlus
  • 15,383
  • 51
  • 135
  • 216
  • 1
    You can do it with C++ of course but I would suggest a scripting language would be far easier - I am a C++ coder but I would never use it for this kind of application, but I have done many using Perl. – mathematician1975 Jun 30 '12 at 16:44
  • Also if you had searched this site you would have found this which may help http://stackoverflow.com/questions/4278024/a-very-simple-c-web-crawler-spider – mathematician1975 Jun 30 '12 at 16:47
  • @mathematician1975: +1 for Perl. And Lua can be a good alternative. – Jack Jun 30 '12 at 16:49
  • Python is very understandable, easy to start with and works very well with crawling. I suggest it. – orlp Jun 30 '12 at 16:50
  • 1
    Doing this in C++ requires big enough amounts of boilerplate unless you can find a framework to use. You're going to have to learn the HTTP protocol, know the HTML standard and study marginal cases where you have to be permissive with the input you expect from the server. – Radu Chivu Jun 30 '12 at 16:51

3 Answers3

2

Also, is C++ a good language of choice for a web crawler?

Depends. How good are you at C++.
C++ is a good language to write an advanced high speed crawler in, because of its speed (and you need that to processes the HTML pages). But it is not the easiest language to write a crawler with so probably not a good choice if you are experimenting.

Based on your question you don't have the experience to write an advanced crawler so are probably looking to build a simple serial crawler. For this speed is not a priority as the bottleneck is the download of the page across the web (not the processing of the page). So I would pick another language (maybe python).

Martin York
  • 257,169
  • 86
  • 333
  • 562
  • I disagree. The answer is simply __no__. You are by no means CPU-blocked either, unless you are doing extraordinary amounts of work for each page - downloading the pages takes a lot longer. And even in that case the crawler should be written in a higher level language and the parts that require performance can be ported to a language like C++. – orlp Jun 30 '12 at 16:49
  • 1
    @nightcracker: We are CPU bound as we are downloading 2000 HTML pages simultaneously (we have a serious crawler). http://devblog.seomoz.org/2012/06/how-does-seomoz-crawl-the-web/ ‎We just can't afford the slowness of scripting languages based solution for maintaining the connections. (PS we have tested it). – Martin York Jun 30 '12 at 16:51
  • I would have never guessed. I assume you aren't using a one-thread-per-connection model (in which case you indeed would get CPU-bound very quickly)? Other than that, if a page download takes 0.5 second, divided by 2000, this means you have 0.25 ms per page to actually do the crawling. Are sockets this slow? – orlp Jun 30 '12 at 16:59
  • @nightcracker: No we use a curl multi handle. Speed of a site depends on the site (google.com is fast, myMomsBreadShopOnTheHomePC.com not so). Also you have to take into account DNS resolution. So you need to build a non default version of curl so you can do async DNS resolution for optimal speed. Not sure I agree with your simple maths. http://devblog.seomoz.org/2011/02/high-performance-libcurl-tips/ – Martin York Jun 30 '12 at 17:04
  • Oh well, I didn't think it would be that heavy. You probably know more about this than me, seeing the software you've built (and tested). On the other hand, 2000 pages per second is quite heavy-duty and most likely above the average crawling operation. I'll retract what I said earlier, I would have never thought you'd get CPU bound (not deleting my comment though, that would make the discussion un-followable by others). P.S.: +1 – orlp Jun 30 '12 at 17:47
  • @nightcracker: I agree; normally C++ is not a good language for a crawler (there are much easier languages to write this in). When bandwidth is your limit there is no point in using anything heavier than an appropriate scripting language. – Martin York Jun 30 '12 at 18:59
0

If you plan to stick with C++, then you should consider using the libcurl library, instead of implementing the HTTP protocol from scratch using sockets. There are C++ bindings available for that library.

From curl's webpage:

libcurl is a free and easy-to-use client-side URL transfer library, supporting DICT, FILE, FTP, FTPS, Gopher, HTTP, HTTPS, IMAP, IMAPS, LDAP, LDAPS, POP3, POP3S, RTMP, RTSP, SCP, SFTP, SMTP, SMTPS, Telnet and TFTP. libcurl supports SSL certificates, HTTP POST, HTTP PUT, FTP uploading, HTTP form based upload, proxies, cookies, user+password authentication (Basic, Digest, NTLM, Negotiate, Kerberos), file transfer resume, http proxy tunneling and more!

libcurl is highly portable, it builds and works identically on numerous platforms, including Solaris, NetBSD, FreeBSD, OpenBSD, Darwin, HPUX, IRIX, AIX, Tru64, Linux, UnixWare, HURD, Windows, Amiga, OS/2, BeOs, Mac OS X, Ultrix, QNX, OpenVMS, RISC OS, Novell NetWare, DOS and more...

libcurl is free, thread-safe, IPv6 compatible, feature rich, well supported, fast, thoroughly documented and is already used by many known, big and successful companies and numerous applications.

Emile Cormier
  • 28,391
  • 15
  • 94
  • 122
0

Short answer, no. I prefer coding in C++ but this instance calls for a Java application. The API has many html parsers plus built in socket protocols. This project will be a pain in C++. I coded one once in java and it was somewhat blissful.

BTW, there are many web crawlers out there but I assume you have custom needs :-)

madreblu
  • 373
  • 5
  • 15