guide on crawling the entire web?

Question

i just had this thought, and was wondering if it's possible to crawl the entire web (just like the big boys!) on a single dedicated server (like Core2Duo, 8gig ram, 750gb disk 100mbps) .

I've come across a paper where this was done....but i cannot recall this paper's title. it was like about crawling the entire web on a single dedicated server using some statistical model.

Anyways, imagine starting with just around 10,000 seed URLs, and doing exhaustive crawl....

is it possible ?

I am in need of crawling the web but limited to a dedicated server. how can i do this, is there an open source solution out there already ?

for example see this real time search engine. http://crawlrapidshare.com the results are exteremely good and freshly updated....how are they doing this ?

Curious HOW LONG it would take to crawl even 50% of the web from a single machine (even on a FAT pipe, real deal cores lots of RAM and HDD space). How long? Any projections? — mvbl fst, Jun 03 '10 at 16:55
google crawls 4 billion pages per day and still they arent able to crawl the whole web. — Munish Goyal, Jan 06 '11 at 19:29
avg page size = 30kB. your 100mpbs will give you 40 million a day and thats theoretical. And yes, your cpu wont be able to catchup parsing them. — Munish Goyal, Jan 06 '11 at 19:32
Average page size based on my crawl over 250 million pages is about 70kB as of 2014. — Lothar, May 02 '16 at 17:32
Googles index was just 5 billion in 2012 (now about 10 billion) when they released Gumbo, their html parser. But this counts content pages other pages are immediately discared by the crawler. DuckDuckGo und Blekko both have 3-4 billions quality pages. — Lothar, May 02 '16 at 17:34

score 23 · Accepted Answer · answered Jan 17 '10 at 08:25

Crawling the Web is conceptually simple. Treat the Web as a very complicated directed graph. Each page is a node. Each link is a directed edge.

You could start with the assumption that a single well-chosen starting point will eventually lead to every other point (eventually). This won't be strictly true but in practice I think you'll find it's mostly true. Still chances are you'll need multiple (maybe thousands) of starting points.

You will want to make sure you don't traverse the same page twice (within a single traversal). In practice the traversal will take so long that it's merely a question of how long before you come back to a particular node and also how you detect and deal with changes (meaning the second time you come to a page it may have changed).

The killer will be how much data you need to store and what you want to do with it once you've got it.

Just start your crawling by sending the entire dictionary to google.com ;) — Marcin Seredynski, Jan 17 '10 at 08:27

score 16 · Answer 2 · answered Aug 30 '15 at 23:36

16

Sorry to revive this thread after so long, but I just wanted to point out that if you are just in need of an extremely large web dataset, there is a much easier way to get it than to attempt crawling the entire web yourself with a single server: just download the free crawl database provided by the Common Crawl project. In their words:

We build and maintain an open repository of web crawl data that can be accessed and analyzed by anyone.

As of today their database is petabytes in size, and contains billions of pages (trillions of links). Just download it, and perform whatever analysis you're interested in there.

answered Aug 30 '15 at 23:36

J. Taylor

4,567
3
35
55

Yes it's on Amazon EC2 and that makes it absolute worthless given the insane prices for processing the common crawl. It is much cheaper to do it yourself. Also it's old not deduplicated and contains a giant mix of all possible data. – Lothar May 02 '16 at 17:45
@Lothar if you're processing say URL only I think someone compiled it. Do you still do crawling today? – CodeGuru Apr 06 '19 at 07:13
No this project ended in 2017. Are you trying to do something like backlink service where only the URL is required? That was one of the side effect businesses we wanted to build. – Lothar Apr 06 '19 at 15:45

score 8 · Answer 3 · answered Jun 03 '10 at 16:49

I believe the paper you're referring to is "IRLbot: Scaling to 6 Billion Pages and Beyond". This was a single server web crawler written by students at Texas A&M.

Leaving aside issues of bandwidth, disk space, crawling strategies, robots.txt/politeness - the main question I've got is "why?" Crawling the entire web means you're using shared resources from many millions of web servers. Currently most webmasters allow bots to crawl them, provided they play nice and obey implicit and explicit rules for polite crawling.

But each high-volume bot that hammers a site without obvious benefit results in a few more sites shutting the door to everything besides the big boys (Google, Yahoo, Bing, etc). So you really want to ask the why question before spending too much time on the how.

Assuming you really do need to crawl a large portion of the web on a single server, then you'd need to get a fatter pipe, lots more storage space (e.g. assume 2K compressed text per page, so 2TB for 1B pages), lots more RAM, at least 4 real cores, etc. The IRLbot paper would be your best guide. You might also want to look at the crawler-commons project for reusable chunks of Java code.

And a final word of caution. It's easy for an innocent mistake to trigger problems for a web site, at which time you'll be on the receiving end of an angry webmaster flame. So make sure you've got thick skin :)

score 4 · Answer 4 · edited Jul 23 '23 at 06:02

See this for an alternative solution, depending on what you'd be looking to do with that much data (even if it were possible): Metacrawlers and Metasearch Engines¹

... EDIT: Also, dont forget, the web is changing all the time, so even relatively small-sized crawling operations (like classifieds sites that aggregate listings from lots of sources) refresh their crawls on a cycle, say, like a 24-hour cycle. That's when website owners may or may not start being inconvenienced by the load your crawler puts on their servers. And then depending on how you use the crawled content, you've got de-duping to think about because you need to teach your systems to recognise whether the crawl results from yesterday are different from those of today etc... gets very "fuzzy", not to mention the computing power needed.

_{1. searchenginewatch.com — archived: February 2010}

score 3 · Answer 5 · answered Mar 03 '13 at 20:31

Bloom filter for detecting where you have been.

There will be false positives but you can get around this by implementing multiple Bloom filters and rotating which Bloom Filter gets added to and creating a filter of impressive length.

http://en.wikipedia.org/wiki/Bloom_filter

Marcin Seredynski · Answer 6 · 2010-01-17T08:22:39.017

2

I bet it is possible. You only need to have a quantum CPU and quantum RAM.

Seriously, a single server wouldn't be able to catch up with the growth of the entire web. Google uses a huge farm of servers (counted in tens, if not hundreds of thousands), and it can't provide you with immediate indexing.

I guess if you're limited to a single server and are in need of crawling the entire web, you're really in need of results of that crawl. Instead of focusing on "how to crawl the web", focus on "how to extract the data you need using Google". A good starting point for that would be: Google AJAX Search API.

edited Jan 17 '10 at 08:22

answered Jan 17 '10 at 08:16

Marcin Seredynski

7,057
3
22
29

Long time since Google removed all legal ways to automate and resuse search results via API. It is only possible illegal and google never returns more then 400 results per query and the ways to customize the search and result is very very very very very very limited. – Lothar May 02 '16 at 17:42

score 0 · Answer 7 · answered Jan 17 '10 at 08:22

0

Sounds possible but the two real problems will be network connection and hard drive space. Speaking as someone who knows almost nothing about web crawling, i'd start with several terabytes of storage and work my way up as i amass more information, and a good broadband internet connection. A deep pocket is a must for this!

answered Jan 17 '10 at 08:22

RCIX

38,647
50
150
207

1

I doubt terabytes are the right units when we're talking about web crawling. Google processes about 20 petabytes of data every day. Read abstract: http://portal.acm.org/citation.cfm?doid=1327452.1327492 – Marcin Seredynski Jan 17 '10 at 08:24
1

True but i seriously doubt someone could pump petabytes through even a broadband connection... – RCIX Jan 17 '10 at 09:12
Peta bytes means search queries and more, not just pages. – Ali Gajani Oct 17 '14 at 13:02
For a search engine you can get along with a single 10GBit machine. But you have to break it into parts and send them to physical continents. Google search is not as big as you think they are. Remember DuckDuckGo was a single home dad who created it out of the basement. Still doing well with only 4 billion pages. – Lothar May 02 '16 at 17:48

score 0 · Answer 8 · answered Jan 17 '10 at 08:26

0

I just wonder the whole Internet should be larger than 750 GB. Moreover, the data structure designed to index the web also takes a lot of storage.

answered Jan 17 '10 at 08:26

xiao 啸

6,350
9
40
51

If you store your index in a good way, you will be able to stuff a LOT of information onto your 750GB harddisk. Noone says that the crawler should store all data from every single Web-page it comes across. For instance, it could check if it's a social site (myface, spacebook, tweeter, lurkedin, a forum or other pages of no interests). If, however, it's a page containing source-code, it could mark it with a single bit, and store the extracted info in a hashref'ed file (for starters). – Dec 01 '13 at 06:41

guide on crawling the entire web?

8 Answers8