1

I am building a small Java application to fetch five Wikipedia pages and find substrings in the html source code. I am using the library org.apache.commons.lang3.StringUtils. However a Wikipedia article can be big, and there seems to be a limitation in StringUtils:

String html;

try {
    html = Jsoup.connect("http://en.wikipedia.org/wiki/Canada").get().html();
} catch(IOException e) {
    html = "";
}

String trimmedHtml = substringBetween(html, "<html>", "</html>");

System.out.println(html); // prints the whole source code fine
System.out.println(trimmedHtml); // prints null

Why does the console print null for trimmedHtml? The output should be (almost) as big as for html. Is there a maximum length for the string output or for the parameters of substringBetween()?

user2864740
  • 60,010
  • 15
  • 145
  • 220
trakmack
  • 93
  • 7

1 Answers1

4

The string util methods work and are well tested - there is no "limitation" or "bug" here.

Viewing the page source reveals that <html> will not match:

<html lang="en" dir="ltr" class="client-nojs">

A great example of why string processing of HTML is not a good idea in general. Keep using the support offered by Jsoup, which might be using the html() method after obtaining the <HTML> element.

user2864740
  • 60,010
  • 15
  • 145
  • 220