I'm working on a web crawler that indexes sites that don't want to be indexed.
My first attempt: I wrote a c# crawler that goes through each and every page and downloads them. This resulted in my IP being blocked by their servers within 10 minutes.
I moved it to amazon EC2 and wrote a distributed python script that runs about 50 instances. This stays just above their threshold of booting me. This also costs about $1900 a month...
I moved back to my initial idea and put it behind a shortened version of the TOR network. This worked, but was very slow.
I'm out of ideas. How can I get past them blocking me for repeated requests.
The I say "block" they are actually giving me a random 404 not found error on pages that definitely exist. It's random and only starts happening after I pass about 300 requests in an hour.