Can rapidminer extract xpaths from a list of URLS, instead of first saving the HTML pages?

Question

I've recently discovered RapidMiner, and I'm very excited about it's capabilities. However I'm still unsure if the program can help me with my specific needs. I want the program to scrape xpath matches from an URL list I've generated with another program. (it has more options then the 'crawl web' operator in RapidMiner)

I've seen the following tutorials from Neil Mcguigan: http://vancouverdata.blogspot.com/2011/04/web-scraping-rapidminer-xpath-web.html. But the websites I try to scrape have thousands of pages, and I don't want to store them all on my pc. And the web crawler simply lacks critical features so I'm unable to use it for my purposes. Is there a way I can just make it read the URLS, and scrape the xpath's from each of those URLS?

I've also looked at other tools for extracting html from pages, but I've been unable to figure out how they work (or even install) since I'm not a programmer. Rapidminer on the other hand is easy to install, the operator descriptions make sense but I've been unable to connect them in the right order.

I need to have some input to keep the motivation going. I would like to know what operator I could use instead of 'process documents from files.' I've looked at 'process documents from web' but it doesn't have an input, and it still needs to crawl. Any help is much appreciated.

Looking forward to your replies.

NaMarPi · Answer 1 · 2013-04-10T20:36:14.337

Web scraping without saving the html pages internally using RapidMiner is a two step process:

Step 1 Follow the video at http://vancouverdata.blogspot.com/2011/04/rapidminer-web-crawling-rapid-miner-web.html by Neil McGuigan with the following difference:

instead of Crawl Web operator use the Process Documents from Web operator. There will not be an option to specify the output directory, because the results will be loaded into the ExampleSet.

ExampleSet will contain links matching the crawling rules.

Process Documents from Web main

Step 2 Follow the video at http://vancouverdata.blogspot.com/2011/04/web-scraping-rapidminer-xpath-web.html but only from 7:40 with the following difference:

put the Extract Information subprocess inside the Process Documents from Web which has been created previously.

ExampleSet will contain the links and the attributes matching the XPath queries.

Extract Information sub

because of the person who found no evidence to support the request of getting back this answer from Community Wiki, there is no need to upvote it. The author of the answer will not get reputation for it. — NaMarPi, Apr 10 '13 at 20:41

miodf · Answer 2 · 2012-05-02T07:55:27.873

0

I have quite the same problem than you and maybe these posts from RapidMiner's forum will help you a little : http://rapid-i.com/rapidforum/index.php/topic,2753.0.html and http://rapid-i.com/rapidforum/index.php?topic=3851.0.html

See ya ;)

edited May 02 '12 at 07:55

answered May 02 '12 at 06:46

miodf

524
3
9
21

Please include in this answer the relevant parts of the links you post. The answer should be self-readable :) (link rot, etc.) – Nikana Reklawyks Oct 20 '12 at 00:20

Can rapidminer extract xpaths from a list of URLS, instead of first saving the HTML pages?

2 Answers2

Linked