There's something I'm not quite understanding about the use of boilerpipe's ArticleExtractor class. Albeit, I am also very new to java, so perhaps my basic knowledge of this enviornemnt is at fault.
anyhow, I'm trying to use boilerpipe to extract the main article from some raw html source I have collected. The html source text is stored in a java.lang.String variable (let's call it htmlstr) variable that has the raw HTML contents of a webpage.
I know how to run boilerpipe to print the extracted text to the output window as follows:
java.lang.String htmlstr = "<!DOCTYPE.... ****html source**** ... </html>";
java.lang.String article = ArticleExtractor.INSTANCE.getText(htmlstr);
System.out.println(article);
However, I'm not sure how to run BP by first instantiating an instance of the ArticleExtractor class, then calling it with the 'TextDocument' input datatype. The TextDocument datatype is itself somehow constructed from BP's 'TextBlock' datatype, and perhaps I am not doing this correctly...
What is the proper way to construct a TextDocument type variable from my htmlstr string variable?
So my problem is then in using the processing method of BP's Article Extractor class aside from calling the ArticleExtractor getText method as per the example above. In other words, I'm not sure how to use the
ArticleExtractor.process(TextDocument doc);
method.
It is my understanding that one is required to run this ArticleExtractor process method to then be able to use the same "TextDocument doc" variable for getting document stats, using BP's
TextDocumentStatistics(TextDocument doc, boolean contentOnly)
method? I would like to use the stats to determine how good the filtering was estimated to be.
Any code examples someone could help me out with?