How to retain the original html format when extracting content from web pages with boilerpipe?

Question

I could extract the title and content (paragraphed) from the web pages on my Android application, but fail in fetching images sometimes.

However, I could not find a way to retain its html format parameters (e.g. bold, with a hyperlink, underline, or font size, etc..) in the extractor.

That is, if a sentence in the web page is equipped with bold, a hyperlink, or underline, how could I extract BOTH the sentence itself and its format parameters ?

I tried this page: An article both by the Web-API and APIs in local jar.

I would like to get the same result using local APIs as what Web-API did.

Could anyone share your experiences to this issue?

Much thanks,

James

Edit #1

Here are the codes:

signalUpdate(STATE.Start);

//
htmlDoc = HTMLFetcher.fetch(new URL(url));

//                  
doc = new BoilerpipeSAXInput(htmlDoc.toInputSource()).getTextDocument();
extraction.setTitle(doc.getTitle());        // obtaining title

ArticleExtractor.INSTANCE.process(doc);     // obtaining content
SplitParagraphBlocksFilter.INSTANCE.process(doc);

contentBuilder.setLength(0);

for(TextBlock block : doc.getTextBlocks()) {
    blockString = "<p>" + block.getText() + "</p>";
    contentBuilder.append(blockString);
}

extraction.setContent(contentBuilder.toString());

// obtaining image
extractor = CommonExtractors.ARTICLE_EXTRACTOR;
ie = ImageExtractor.INSTANCE;
imgUrls = ie.process(new URL(url), extractor);
extraction.setImgUrls(imgUrls);

//
signalUpdate(STATE.Complete);

Actually, what I mean by "fail" is:

I could fetch images from some web sites. However, I could not get image in this article mentioned above.

What exactly do you mean by "fail in fetching images sometimes"? Can you please add some code example for this and explain what's not working. — Friederike, Jul 24 '13 at 10:44

How to retain the original html format when extracting content from web pages with boilerpipe?

0 Answers0