1

I am crawling websites using crawler4j. I am using jsoup to extract content and save it in a text format file. Then I use omegaT to find the number of words in those text files.

The problem I am having is with text extraction. I am using the following function to extract the text from html.

public static String cleanTagPerservingLineBreaks(String html) {
    String result = "";
    if (html == null)
        return html;
    Document document = Jsoup.parse(html);

    document.outputSettings(new Document.OutputSettings()
            .prettyPrint(false));
    document.select("br").append("\\n");
    document.select("p").prepend("\\n\\n");
    result = document.html().replaceAll("\\\\n", "\n");
    result = result.replaceAll(" ", " ");
    result = result.trim();
    result = Jsoup.clean(result, "", Whitelist.none(),
            new Document.OutputSettings().prettyPrint(false));
    return result;
}

In the line result = document.html().replaceAll("\\\\n", "\n"); when I use document.text() it gives me a well formatted text with appropriate spaces. But when I do the word count from omegaT, the unique words are not shown properly. If I keep using document.html() then I get a proper word count but there are no paces between some text(eg. WomenNew ArrivalsTops & BlousesPants & DenimDresses & SkirtsMenView All MenNew) and tags like strong, em are not removed by Jsoup.

Is there a way to put spaces between all the tags and properly strip content? And a explanation on why the fluctuation in word count is happening, if possible.

  • What do you mean by saying that when you use `document.text()` it doesn't show properly word counts in omegaT? As I see the resulting `String` is correct – Maciej Dobrowolski Mar 15 '16 at 20:00
  • Yes you'll need to shed more light on "words are not shown properly". `document.text()` works. In fact all the hard work (work-around) you want to put into fixing "spaces between all the tags and properly strip content" is already done by the `document.text()` call. – DaddyMoe Jan 09 '17 at 14:07

0 Answers0