extract the main part of a page in java

Question

Hello I have a page of a personality in wikipedia and I want to extract with java source a code HTML from the main part is that.

Do you have any ideas?

Specifically about Wikipedia: There's an API. If you don't want to use that, you should at least call the page [like this](http://meta.wikimedia.org/w/index.php?action=render&title=List_of_Wikipedias) to reduce the transfer size. — thirtydot, Mar 09 '11 at 18:43

score 2 · Answer 1 · answered Mar 21 '11 at 22:00

Use Jsoup, specifically the selector syntax.

Document doc = Jsoup.parse(new URL("http://en.wikipedia.org/", 10000);
Elements interestingParts = doc.select("div.interestingClass");

//get the combined HTML fragments as a String
String selectedHtmlAsString = interestingParts.html();

//get all the links
Elements links = interestingParts.select("a[href]");

//filter the document to include certain tags only
Whitelist allowedTags = Whitelist.simpleText().addTags("blockquote","code", "p");
Cleaner cleaner = new Cleaner(allowedTags);
Document filteredDoc = cleaner.clean(doc);

It's a very useful API for parsing HTML pages and extracting the desired data.

score 1 · Answer 2 · answered Mar 09 '11 at 18:46

1

For wikipedia there is API: http://www.mediawiki.org/wiki/API:Main_page

answered Mar 09 '11 at 18:46

ilalex

3,018
2
24
37

score 0 · Answer 3 · answered Mar 09 '11 at 18:40

0

Analyze web page's structure
Use JSoup to parse HTML

answered Mar 09 '11 at 18:40

jmj

237,923
42
401
438

1

Its not legal thing to do probably – jmj Mar 09 '11 at 18:42

score 0 · Answer 4 · answered Mar 09 '11 at 18:43

Note that this returns a STRING (blob of a sort) of the HTML source code, not a nicely formatted content item.

I use this myself - a little snippet I have for whatever i need. Pass in the url, any start and stop text, or the boolean to get everything.

public static String getPage(
      String url, 
      String booleanStart, 
      String booleanStop, 
      boolean getAll) throws Exception {
    StringBuilder page = new StringBuilder();
    URL iso3 = new URL(url);
    URLConnection iso3conn = iso3.openConnection();
    BufferedReader in = new BufferedReader(
        new InputStreamReader(
            iso3conn.getInputStream()));
    String inputLine;

    if (getAll) {
      while ((inputLine = in.readLine()) != null) {
        page.append(inputLine);
      }
    } else {    
      boolean save = false;
      while ((inputLine = in.readLine()) != null) {
        if (inputLine.contains(booleanStart)) 
          save = true;
        if (save) 
          page.append(inputLine);
        if (save && inputLine.contains(booleanStop)) {
          break;
        }
      }
    }
    in.close();
    return page.toString();
  }

extract the main part of a page in java

4 Answers4