I need help figuring out how to crawl through this page: http://www.marinetraffic.com/en/ais/index/ports/all go through each port, and extract the name and coordinates and write them onto a file. The main class looks as follows:
import java.io.FileWriter;
import edu.uci.ics.crawler4j.crawler.CrawlConfig;
import edu.uci.ics.crawler4j.crawler.CrawlController;
import edu.uci.ics.crawler4j.fetcher.PageFetcher;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer;
public class WorldPortSourceCrawler {
public static void main(String[] args) throws Exception {
String crawlStorageFolder = "data";
int numberOfCrawlers = 5;
CrawlConfig config = new CrawlConfig();
config.setCrawlStorageFolder(crawlStorageFolder);
config.setMaxDepthOfCrawling(2);
config.setUserAgentString("Sorry for any inconvenience, I am trying to keep the traffic low per second");
//config.setPolitenessDelay(20);
/*
* Instantiate the controller for this crawl.
*/
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
/*
* For each crawl, you need to add some seed urls. These are the first
* URLs that are fetched and then the crawler starts following links
* which are found in these pages
*/
controller.addSeed("http://www.marinetraffic.com/en/ais/index/ports/all");
/*
* Start the crawl. This is a blocking operation, meaning that your code
* will reach the line after this only when crawling is finished.
*/
controller.start(PortExtractor.class, numberOfCrawlers);
System.out.println("finished reading");
System.out.println("Ports: " + PortExtractor.portList.size());
FileWriter writer = new FileWriter("PortInfo2.txt");
System.out.println("Writing to file...");
for(Port p : PortExtractor.portList){
writer.append(p.print() + "\n");
writer.flush();
}
writer.close();
System.out.println("File written");
}
}
While the Port Extractor looks like this:
public class PortExtractor extends WebCrawler{
private final static Pattern FILTERS = Pattern.compile(
".*(\\.(css|js|bmp|gif|jpe?g"
+ "|png|tiff?|mid|mp2|mp3|mp4"
+ "|wav|avi|mov|mpeg|ram|m4v|pdf"
+ "|rm|smil|wmv|swf|wma|zip|rar|gz))$"
);
public static List<Port> portList = new ArrayList<Port>();
/**
*
* Crawling logic
*/
//@Override
public boolean shouldVisit(WebURL url) {
String href = url.getURL().toLowerCase();
//return !FILTERS.matcher(href).matches()&&href.startsWith("http://www.worldportsource.com/countries.php") && !href.contains("/shipping/") && !href.contains("/cruising/") && !href.contains("/Today's Port of Call/") && !href.contains("/cruising/") && !href.contains("/portcall/") && !href.contains("/localviews/") && !href.contains("/commerce/")&& !href.contains("/maps/") && !href.contains("/waterways/");
return !FILTERS.matcher(href).matches() && href.startsWith("http://www.marinetraffic.com/en/ais/index/ports/all");
}
/**
* This function is called when a page is fetched and ready
* to be processed
*/
@Override
public void visit(Page page) {
String url = page.getWebURL().getURL();
System.out.println("URL: " + url);
}
}
How do I go about writing the html parser, also how can I specify to the program that it should not crawl through anything other than the port info links? I'm having difficulty with this as even with the code running, it breaks everytime I try to work with the HTML parsing. Please any help would be much appreciated.