1

Is it possible to get if an URL is 404 or 301 in crawler4j ?

@Override
    public void visit(Page page) {
        String url = page.getWebURL().getURL();
        System.out.println("URL: " + url);

        if (page.getParseData() instanceof HtmlParseData) {
            HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
            String text = htmlParseData.getText();
            String html = htmlParseData.getHtml();
            List<WebURL> links = htmlParseData.getOutgoingUrls();

            System.out.println("Text length: " + text.length());
            System.out.println("Html length: " + html.length());
            System.out.println("Number of outgoing links: " + links.size());
        }
    }

I use this in the crawler code .Can anyone tell me how ?

Kathick
  • 1,395
  • 5
  • 19
  • 30

1 Answers1

2

As Crawler4j Version 3.3 (Feb 2012 released) - Crawler4j supportting for handling http status codes for fetched pages.

to visit StatusHandlerCrawlerExample click.

Also you can parse pages by using Jsoup (Java HTML Parser, with best of DOM, CSS, and jquery). And there is an example here - shows how to download a page from given URL and getting page status code. I think you should use Crawler4j for crawling and Jsoup for page fetching.

Community
  • 1
  • 1
cuneytykaya
  • 579
  • 1
  • 5
  • 14