I have a crawler that crawl a few different domains for new posts/content. The total amount of content is hundred of thousands of pages, and there is a lot of new content added each day. So to be able to crawl through all this content, I need my crawler to be crawling 24/7.
Currently I host the crawler script on the same server as the site the crawler is adding the content to, and I'm only able to run a cronjob to run the script during nighttime, because when I do, the website basically stops working because the load of the script. In other words, a pretty crappy solution.
So basically I wonder what my best option is for this kind of solution?
Is it possible to keep running the crawler from the same host, but somehow balancing the load so that the script doesnt kill the website?
What kind of host/server would I be looking for to host a crawler? Is there any other specifications I need than a normal web host?
The crawler saves images that it crawls. If I host my crawler on a secondary server, how do I save my images on the server of my site? I guess I dont want CHMOD 777 on my uploads-folder and allow anyone to put files on my server.