When I tried it, I got 403 forbidden
errors because the IMDB service thinks I am a robot. Using Loop
with Crawl Web
is bad practice because the Loop
operator does not implement any waiting.
This process can be reduced to just the Crawl Web
operator. The key parameters are:
- URL - set this to http://www.imdb.com/title/tt0454876
- max pages - set this to 79 or whatever number you need
- max page size - set this to 1000
- crawling rules - set these to the ones you specified
- output dir - choose a folder to store things in
This works because the crawl operator will work out all possible URLs that match the rules and will store those that also match. The visits will be delayed by 1000 ms (the delay parameter) to avoid triggering a robot exclusion at the server.
Hope this gets you going as a start.