how to parse the html when using crawler4j

Question

Recently,I had to crawl some website with open Source project crawler4j.However,crawler4j didn't offer any api for using.Now,i came to a problem that how i can parse a html with the function and class provided by crawler4j and find element like we do with jquery

Can you get me more tips about how to combine crawler4j with JSoup?Because as I found that crawler4j dind't have a class like Document class which provided by Jsoup — mly, Sep 06 '13 at 04:56
I've not used crawler4j but is it possible to return the page's html? If you can get the html, or even just the URL, you can set up JSoup so that it points to the URL and you can start parsing. — Ben Dale, Sep 06 '13 at 08:04
I am doing exactly this for my project. After reading your comment, I got confirmation that my approach is right. Thanks — clever_bassi, Jul 02 '14 at 17:23

score 8 · Answer 1 · answered Sep 16 '13 at 06:56

It's relatively simple. The following approach worked for me.

In MyCrawler.java:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
...
public void visit(Page page) {
...
if (page.getParseData() instanceof HtmlParseData) {
                    HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
                    String html = htmlParseData.getHtml();
                    Document doc = Jsoup.parseBodyFragment(html);
...

how to parse the html when using crawler4j

1 Answers1