substringBetween() returns null when trying to extract ..

Question

I am building a small Java application to fetch five Wikipedia pages and find substrings in the html source code. I am using the library org.apache.commons.lang3.StringUtils. However a Wikipedia article can be big, and there seems to be a limitation in StringUtils:

String html;

try {
    html = Jsoup.connect("http://en.wikipedia.org/wiki/Canada").get().html();
} catch(IOException e) {
    html = "";
}

String trimmedHtml = substringBetween(html, "<html>", "</html>");

System.out.println(html); // prints the whole source code fine
System.out.println(trimmedHtml); // prints null

Why does the console print null for trimmedHtml? The output should be (almost) as big as for html. Is there a maximum length for the string output or for the parameters of substringBetween()?

user2864740 · Accepted Answer · 2014-08-15T03:10:31.667

4

The string util methods work and are well tested - there is no "limitation" or "bug" here.

Viewing the page source reveals that <html> will not match:

<html lang="en" dir="ltr" class="client-nojs">

A great example of why string processing of HTML is not a good idea in general. Keep using the support offered by Jsoup, which might be using the html() method after obtaining the <HTML> element.

edited Aug 15 '14 at 03:10

answered Aug 15 '14 at 03:04

user2864740

60,010
15
145
220

substringBetween() returns null when trying to extract ..

1 Answers1