I am crawling websites using crawler4j. I am using jsoup to extract content and save it in a text format file. Then I use omegaT to find the number of words in those text files.
The problem I am having is with text extraction. I am using the following function to extract the text from html.
public static String cleanTagPerservingLineBreaks(String html) {
String result = "";
if (html == null)
return html;
Document document = Jsoup.parse(html);
document.outputSettings(new Document.OutputSettings()
.prettyPrint(false));
document.select("br").append("\\n");
document.select("p").prepend("\\n\\n");
result = document.html().replaceAll("\\\\n", "\n");
result = result.replaceAll(" ", " ");
result = result.trim();
result = Jsoup.clean(result, "", Whitelist.none(),
new Document.OutputSettings().prettyPrint(false));
return result;
}
In the line result = document.html().replaceAll("\\\\n", "\n");
when I use document.text()
it gives me a well formatted text with appropriate spaces. But when I do the word count from omegaT, the unique words are not shown properly. If I keep using document.html()
then I get a proper word count but there are no paces between some text(eg. WomenNew ArrivalsTops & BlousesPants & DenimDresses & SkirtsMenView All MenNew) and tags like strong, em are not removed by Jsoup.
Is there a way to put spaces between all the tags and properly strip content? And a explanation on why the fluctuation in word count is happening, if possible.