10

I'm working on a web crawler that indexes sites that don't want to be indexed.

My first attempt: I wrote a c# crawler that goes through each and every page and downloads them. This resulted in my IP being blocked by their servers within 10 minutes.

I moved it to amazon EC2 and wrote a distributed python script that runs about 50 instances. This stays just above their threshold of booting me. This also costs about $1900 a month...

I moved back to my initial idea and put it behind a shortened version of the TOR network. This worked, but was very slow.

I'm out of ideas. How can I get past them blocking me for repeated requests.

The I say "block" they are actually giving me a random 404 not found error on pages that definitely exist. It's random and only starts happening after I pass about 300 requests in an hour.

brandon
  • 1,230
  • 3
  • 13
  • 31

4 Answers4

13

OK, first and foremost: if a website doesn't want you to crawl it too often then you shouldn't! It's basic politeness and you should always try to adhere to it.

However, I do understand that there are some websites, like Google, who make their money by crawling your website all day long and when you try to crawl Google, then they block you.

Solution 1: Proxy Servers

In any case, the alternative to getting a bunch of EC2 machines is to get proxy servers. Proxy servers are MUCH cheaper than EC2, case and point: http://5socks.net/en_proxy_socks_tarifs.htm

Of course, proxy servers are not as fast as EC2 (bandwidth wise), but you should be able to strike a balance where you're getting similar or higher throughput than your 50 EC2 instances for substantially less than what you're paying now. This involves you searching for affordable proxies and finding ones that will give you similar results. A thing to note here is that just like you, there may be other people using the proxy service to crawl the website you're crawling and they may not be as smart about how they crawl it, so the whole proxy service can get blocked due to the activity of some other client of the proxy service (I've personally seen it).

Solution 2: You-Da-Proxy!

This is a little crazy and I haven't done the math behind this, but you could start a proxy service yourself and sell proxy services to others. You can't use all of your EC2 machine's bandwidth anyway, so the best way for you to cut cost is to do what Amazon does: sub-lease the hardware.

Kiril
  • 39,672
  • 31
  • 167
  • 226
  • The site I'm crawling is most likely not getting crawled by any other service. I was thinking proxy, but I'm always hesitant for the reason you mentioned. Now that you mention it though, nobody else will be crawling this site. Thanks! – brandon Dec 12 '11 at 15:35
  • 1
    You just solved my problem, I found some proxies on EC2. Since their in the same zone as my servers, but with 100 different IPs my life just got a lot cheaper with very limited sacrifice in bandwidth. Just did the math. My bill will be about $150 now! – brandon Dec 12 '11 at 15:38
  • 1
    Swing some cash my way then :), with $1,700 in savings ^^. Hehe, glad I could help tho. – Kiril Dec 12 '11 at 15:43
2

Using proxies is, by far, the most common way to tackle this problem. There are other higher-level solutions that provide a sort of "page downloading as a service" guaranteeing you get "clean" pages (not 404s, etc). One of these is called Crawlera (provided by my company) but there may be others.

Pablo Hoffman
  • 1,540
  • 13
  • 19
2

For this case I usually use https://gimmeproxy.com which checks proxy every second.

To get working proxy, you need just to make the following request:

https://gimmeproxy.com/api/getProxy

You will get JSON response with all proxy data which you can use later as needed:

{
  "supportsHttps": true,
  "protocol": "socks5",
  "ip": "156.182.122.82:31915",
  "port": "31915",
  "get": true,
  "post": true,
  "cookies": true,
  "referer": true,
  "user-agent": true,
  "anonymityLevel": 1,
  "websites": {
    "example": true,
    "google": false,
    "amazon": true
  },
  "country": "BR",
  "tsChecked": 1517952910,
  "curl": "socks5://156.182.122.82:31915",
  "ipPort": "156.182.122.82:31915",
  "type": "socks5",
  "speed": 37.78,
  "otherProtocols": {}
}
Andrey E
  • 605
  • 4
  • 8
1

Whenever I have to pass the requests limit of the pages that I'm crawling, I usually do it with proxycrawl as it's the fastest way to go. You don't have to care about anything, infrastructure, ips, being blocked etc...

They have a simple API which you can call as frequent as you want and they will always return you a valid response skipping the limits.

https://api.proxycrawl.com?url=https://somesite.com

So far I've been using it for some months and works great. They even have a free plan.

ajimix
  • 974
  • 7
  • 16
  • How much webpages are you crawling per second/min ? Have you faced any instances where the proxy server used by proxycrawl itself is blacklisted ? – Sahil Gupta Dec 26 '17 at 17:30
  • I crawl many different sites. I don't know about their proxies and I don't really care, they don't charge me for failed requests so if a request fails, I just call the same page again. They take care of the proxies and so on, not me – ajimix Dec 29 '17 at 16:48