Is Possible Yes, I have already build a solution on Appengine - wowprice
Sharing all details here will make my answer lengthy,
Problem - Suppose I want to crawl walmart.com, As i known that I cant crawl in one shot(millions products)
Solution - I have designed my spider to break the task in smaller task.
- Step 1 : I input job for walmart.com, Job scheduler will create a task.
- Step 2 : My spider will pick the job and its notice that Its index page, now my spider will create more jobs as starting page as categories page, Now its enters 20 more tasks
- Step 3 : now spider make more smaller jobs for subcategories, and its will go till it gets product list page and create task for it.
- Step 4 : for product list pages, its get the product and make call to to stores the product data and in case of next page It ll make one task to crawl them.
Advantages -
We can crawl without breaking 30 seconds rules, and speed of crawling will depends backend machine, It will provide parallel crawling for single target.