how to run and get document stats from boilerpipe article extractor?

Question

There's something I'm not quite understanding about the use of boilerpipe's ArticleExtractor class. Albeit, I am also very new to java, so perhaps my basic knowledge of this enviornemnt is at fault.

anyhow, I'm trying to use boilerpipe to extract the main article from some raw html source I have collected. The html source text is stored in a java.lang.String variable (let's call it htmlstr) variable that has the raw HTML contents of a webpage.

I know how to run boilerpipe to print the extracted text to the output window as follows:

java.lang.String htmlstr = "<!DOCTYPE.... ****html source**** ... </html>";

java.lang.String article = ArticleExtractor.INSTANCE.getText(htmlstr);
System.out.println(article);

However, I'm not sure how to run BP by first instantiating an instance of the ArticleExtractor class, then calling it with the 'TextDocument' input datatype. The TextDocument datatype is itself somehow constructed from BP's 'TextBlock' datatype, and perhaps I am not doing this correctly...

What is the proper way to construct a TextDocument type variable from my htmlstr string variable?

So my problem is then in using the processing method of BP's Article Extractor class aside from calling the ArticleExtractor getText method as per the example above. In other words, I'm not sure how to use the

ArticleExtractor.process(TextDocument doc);

method.

It is my understanding that one is required to run this ArticleExtractor process method to then be able to use the same "TextDocument doc" variable for getting document stats, using BP's

TextDocumentStatistics(TextDocument doc, boolean contentOnly)

method? I would like to use the stats to determine how good the filtering was estimated to be.

Any code examples someone could help me out with?

Retagging. This is [tag:web-scraping] (= data extraction from web pages), not [tag:data-mining] (= complex statistical analysis) — Has QUIT--Anony-Mousse, Jun 26 '12 at 06:08

score 1 · Accepted Answer · answered Jun 28 '12 at 06:59

Code written in Jython (Conversion to java should be easy)

1) How to get TextDocument from a HTML String:

import org.xml.sax.InputSource as InputSource
import de.l3s.boilerpipe.sax.HTMLDocument as HTMLDocument
import de.l3s.boilerpipe.document.TextDocument as TextDocument
import de.l3s.boilerpipe.sax.BoilerpipeSAXInput as BoilerpipeSAXInput
import de.l3s.boilerpipe.extractors.ArticleExtractor as ArticleExtractor
import de.l3s.boilerpipe.estimators.SimpleEstimator as SimpleEstimator
import de.l3s.boilerpipe.document.TextDocumentStatistics as TextDocumentStatistics
import de.l3s.boilerpipe.document.TextBlock as TextBlock

htmlDoc = HTMLDocument(rawHtmlString)
inputSource = htmlDoc.toInputSource() 
boilerpipeSaxInput = BoilerpipeSAXInput(inputSource)
textDocument = boilerpipeSaxInput.getTextDocument()

2) How to process TextDocument using Article Extractor (continued from above)

content = ArticleExtractor.INSTANCE.getText(textDocument)

3) How to get TextDocumentStatistics (continued from above)

content_list = [] #replace python 'List' Object with ArrayList in java
content_list.append(TextBlock(content)) #replace with arrayList.add(TextBlock(content))
content_td = TextDocument(content_list)
content_stats = TextDocumentStatistics(content_td, True)#True for article content statistics only

Note: The java docs accompanied with the boilerpipe 1.2.jar library should be somewhat useful for future reference

Thanks Kevin, I am just not familiar enough with java syntax to correctly implement what I was reading in the javadoc. I'm actually implementing the java through matlab... got it working now, all except for the document stats always returns 0 for the getNumWords method. — brneuro, Jul 06 '12 at 13:50

how to run and get document stats from boilerpipe article extractor?

1 Answers1