0

I am trying to crawl review for a particular movie review from IMDB website. For this I am using crawl web which i have embedded inside loop as there are 74 pages.

Attached are the images of configuration. Please help. Am badly stuck in this.

URL for Crawl Web is: http://www.imdb.com/title/tt0454876/reviews?start=%{pagePos}

enter image description here

Piyush Gupta
  • 2,181
  • 3
  • 13
  • 28
Kartik Solanki
  • 161
  • 1
  • 10

1 Answers1

0

When I tried it, I got 403 forbidden errors because the IMDB service thinks I am a robot. Using Loop with Crawl Web is bad practice because the Loop operator does not implement any waiting.

This process can be reduced to just the Crawl Web operator. The key parameters are:

  • URL - set this to http://www.imdb.com/title/tt0454876
  • max pages - set this to 79 or whatever number you need
  • max page size - set this to 1000
  • crawling rules - set these to the ones you specified
  • output dir - choose a folder to store things in

This works because the crawl operator will work out all possible URLs that match the rules and will store those that also match. The visits will be delayed by 1000 ms (the delay parameter) to avoid triggering a robot exclusion at the server.

Hope this gets you going as a start.

Andrew Chisholm
  • 6,362
  • 2
  • 22
  • 41
  • I have already initialized the macro as value 0 and am adding 10 in each iteration beacuse the webpages for reviews are http://www.imdb.com/title/tt0454876/reviews?start=0 http://www.imdb.com/title/tt0454876/reviews?start=10 http://www.imdb.com/title/tt0454876/reviews?start=20 and so on. Thats why I am using 10 increment in each loop in order to fetch all the reviews. Can u please guide me how should i fix my execution order ?? – Kartik Solanki Apr 18 '16 at 15:23
  • Also I have initialized the macro in context tab as Macro name 'pagePos' and value as '0'. Can u tell me what should be the execution order inside the loop ??? Also what should be the crawling rule as i need to fetch just the reviews ?? Am just a begineer in Rapidminer so please help me. – Kartik Solanki Apr 18 '16 at 15:44
  • The current process gives 403 errors. The reason is likely to be incorrect usage of `Crawl Web` in a tight loop accessing a URL directly. The process can be simplified to avoid using the `Loop` operator at all. I've updated my answer. – Andrew Chisholm Apr 18 '16 at 21:38