3

Recently,I had to crawl some website with open Source project crawler4j.However,crawler4j didn't offer any api for using.Now,i came to a problem that how i can parse a html with the function and class provided by crawler4j and find element like we do with jquery

mly
  • 31
  • 2
  • 5
    Could you not combine crawler4j with JSoup? – Ben Dale Sep 05 '13 at 14:36
  • Can you get me more tips about how to combine crawler4j with JSoup?Because as I found that crawler4j dind't have a class like Document class which provided by Jsoup – mly Sep 06 '13 at 04:56
  • 1
    I've not used crawler4j but is it possible to return the page's html? If you can get the html, or even just the URL, you can set up JSoup so that it points to the URL and you can start parsing. – Ben Dale Sep 06 '13 at 08:04
  • I am doing exactly this for my project. After reading your comment, I got confirmation that my approach is right. Thanks – clever_bassi Jul 02 '14 at 17:23

1 Answers1

8

It's relatively simple. The following approach worked for me.

In MyCrawler.java:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
...
public void visit(Page page) {
...
if (page.getParseData() instanceof HtmlParseData) {
                    HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
                    String html = htmlParseData.getHtml();
                    Document doc = Jsoup.parseBodyFragment(html);
...
vigneshwerv
  • 155
  • 11