Crawling and extracting info using crawler4j

Question

I need help figuring out how to crawl through this page: http://www.marinetraffic.com/en/ais/index/ports/all go through each port, and extract the name and coordinates and write them onto a file. The main class looks as follows:

import java.io.FileWriter;

import edu.uci.ics.crawler4j.crawler.CrawlConfig;
import edu.uci.ics.crawler4j.crawler.CrawlController;
import edu.uci.ics.crawler4j.fetcher.PageFetcher;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer;


public class WorldPortSourceCrawler {

    public static void main(String[] args) throws Exception {
         String crawlStorageFolder = "data";
         int numberOfCrawlers = 5;

         CrawlConfig config = new CrawlConfig();
         config.setCrawlStorageFolder(crawlStorageFolder);
         config.setMaxDepthOfCrawling(2);
         config.setUserAgentString("Sorry for any inconvenience, I am trying to keep the traffic low per second");
         //config.setPolitenessDelay(20);
         /*
          * Instantiate the controller for this crawl.
          */
         PageFetcher pageFetcher = new PageFetcher(config);
         RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
         RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
         CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);

         /*
          * For each crawl, you need to add some seed urls. These are the first
          * URLs that are fetched and then the crawler starts following links
          * which are found in these pages
          */
         controller.addSeed("http://www.marinetraffic.com/en/ais/index/ports/all");

         /*
          * Start the crawl. This is a blocking operation, meaning that your code
          * will reach the line after this only when crawling is finished.
          */
         controller.start(PortExtractor.class, numberOfCrawlers);    

         System.out.println("finished reading");
         System.out.println("Ports: " + PortExtractor.portList.size());
         FileWriter writer = new FileWriter("PortInfo2.txt");

         System.out.println("Writing to file...");
         for(Port p : PortExtractor.portList){
            writer.append(p.print() + "\n");
            writer.flush();
         }
         writer.close();
        System.out.println("File written");
        }
}

While the Port Extractor looks like this:

public class PortExtractor extends WebCrawler{

    private final static Pattern FILTERS = Pattern.compile(
            ".*(\\.(css|js|bmp|gif|jpe?g"
            + "|png|tiff?|mid|mp2|mp3|mp4"
            + "|wav|avi|mov|mpeg|ram|m4v|pdf"
            + "|rm|smil|wmv|swf|wma|zip|rar|gz))$"
        );

    public static List<Port> portList = new ArrayList<Port>();

/**
*
* Crawling logic
*/
//@Override
public boolean shouldVisit(WebURL url) {

String href = url.getURL().toLowerCase();
//return  !FILTERS.matcher(href).matches()&&href.startsWith("http://www.worldportsource.com/countries.php") && !href.contains("/shipping/") && !href.contains("/cruising/") && !href.contains("/Today's Port of Call/") && !href.contains("/cruising/") && !href.contains("/portcall/") && !href.contains("/localviews/") && !href.contains("/commerce/")&& !href.contains("/maps/") && !href.contains("/waterways/");
return !FILTERS.matcher(href).matches() && href.startsWith("http://www.marinetraffic.com/en/ais/index/ports/all");
}



/**
* This function is called when a page is fetched and ready 
* to be processed
*/
@Override
public void visit(Page page) {          
String url = page.getWebURL().getURL();
System.out.println("URL: " + url);

   }

}

How do I go about writing the html parser, also how can I specify to the program that it should not crawl through anything other than the port info links? I'm having difficulty with this as even with the code running, it breaks everytime I try to work with the HTML parsing. Please any help would be much appreciated.

rzo1 · Accepted Answer · 2016-12-14T14:50:53.710

2

First task is to check the robots.txt of the site in order to checkl, whether crawler4j will acutally crawl this website. Investigating this file, we find, that this will no problem:

User-agent: *
Allow: /
Disallow: /mob/
Disallow: /upload/
Disallow: /users/
Disallow: /wiki/

Second, we need to figure out, which links are of particular interest for your purpose. This needs some manual investigation. I only checked a few entries of the link mentioned above, but I found, that every port contains the keyword ports in its link, e.g.

http://www.marinetraffic.com/en/ais/index/ports/all/per_page:50
http://www.marinetraffic.com/en/ais/details/ports/18853/China_port:YANGZHOU
http://www.marinetraffic.com/en/ais/details/ports/793/Korea_port:BUSAN

With this information, we are able to modify the shouldVisit method in a whitelisting manner.

public boolean shouldVisit(Page referringPage, WebURL url){

String href = url.getURL().toLowerCase();
return  !FILTERS.matcher(href).matches()
        && href.contains("www.marinetraffic.com");
        && href.contains("ports");
}

This is a very simple implementation, which could be enhanced by regular expressions.

Third, we need to parse the data out of the HTML. The information you are looking for is contained in the following <div> section:

<div class="bg-info bg-light padding-10 radius-4 text-left">
    <div>
        <span>Latitude / Longitude: </span>
        <b>1.2593655° / 103.75445°</b>
        <a href="/en/ais/home/zoom:14/centerx:103.75445/centery:1.2593655" title="Show on Map"><img class="loaded" src="/img/icons/show_on_map_magnify.png" data-original="/img/icons/show_on_map_magnify.png" alt="Show on Map" title="Show on Map"></a>
        <a href="/en/ais/home/zoom:14/centerx:103.75445/centery:1.2593655/showports:1" title="Show on Map">Show on Map</a>
    </div>

    <div>
        <span>Local Time:</span>
                <b><time>2016-12-11 19:20</time>&nbsp;[UTC +8]</b>
    </div>

            <div>
            <span>Un/locode: </span>
            <b>SGSIN</b>
        </div>

            <div>
            <span>Vessels in Port: </span>
            <b><a href="/en/ais/index/ships/range/port_id:290/port_name:SINGAPORE">1021</a></b>
        </div>

            <div>
            <span>Expected Arrivals: </span>
            <b><a href="/en/ais/index/eta/all/port:290/portname:SINGAPORE">1059</a></b>
        </div>

</div>

Basically, I would use a HTML Parser (e.g. Jericho) for this task. Then, you are able to exactly extract the correct <div> section and obtain the attributes you are looking for.

edited Dec 14 '16 at 14:50

answered Dec 11 '16 at 11:24

rzo1

5,561
3
25
64

Thanks the shouldVisit() is better tweaked than what I initially had; yet my crawler seems to only visit the page set forth by the seed and doesn't go any deeper after that. Any reasons why this could be? – Almanz Dec 13 '16 at 18:46
You set maxDepth to 2. Check your crawlconfig again or update your question with the current one. – rzo1 Dec 13 '16 at 20:01
I updated my code to the screen; as for crawling depth I left that untouched and just depended on the filtering to do its job. Yet, I still seem to have trouble as the crawler only crawls to the url:http://www.marinetraffic.com/en/ais/index/ports/all – Almanz Dec 13 '16 at 20:11
Try a higher value for config.setMaxDepthOfCrawling(2); or -1 for unlimited depth – rzo1 Dec 13 '16 at 20:18
I just tried that, still not working.. It's odd, everything is set up properly, I sense its a really lame issue but I just cant seem to figure it out! – Almanz Dec 13 '16 at 20:28
@Almanz You did not override shouldVisit in a correct way. For this reason it always returned true. I updated my answer. – rzo1 Dec 14 '16 at 14:51
Initially I had it written that way, but its still not crawling it. Could there be something in the crawler class that needs to be changed? What possible reasons can cause the crawler to not crawlr? Could it be that I'm filtering out all possible links to visit? – Almanz Dec 14 '16 at 20:00
I'm starting to feel this site can't be crawled. This isn't the first website that I crawl with my code, yet this code doesn't work on this website – Almanz Dec 14 '16 at 20:02
I tried the above code with adaption of the shouldVisit method with the latest GitHub version - this worked for me out of the box. Maybe you can upload your project on GitHub, so I can investigate the code more deeply. Other reason could be an IP ban by the site system administrator – rzo1 Dec 14 '16 at 20:04
I wish I could, but it's company policy to not do that.. I e-mailed the site about possible blockage of my ip, and hope they will be swift with their reply. Cause you are right, that may be what's going on. – Almanz Dec 14 '16 at 20:25
It could also be a problem with your company network. I tried it at our company network and at home - it just worked. Another strategy would be to log the output of shouldVisit (overriden) with the boolean value, the Link to visit and the url the Link was Found on. Then, check if the links extracted comply with your needs, e.g. check the source manually and then look for links and compare that with the log output. – rzo1 Dec 14 '16 at 20:36

Crawling and extracting info using crawler4j

1 Answers1