I could extract the title and content (paragraphed) from the web pages on my Android application, but fail in fetching images sometimes.
However, I could not find a way to retain its html format parameters (e.g. bold, with a hyperlink, underline, or font size, etc..) in the extractor.
That is, if a sentence in the web page is equipped with bold, a hyperlink, or underline, how could I extract BOTH the sentence itself and its format parameters ?
I tried this page: An article both by the Web-API and APIs in local jar.
I would like to get the same result using local APIs as what Web-API did.
Could anyone share your experiences to this issue?
Much thanks,
James
Edit #1
Here are the codes:
signalUpdate(STATE.Start);
//
htmlDoc = HTMLFetcher.fetch(new URL(url));
//
doc = new BoilerpipeSAXInput(htmlDoc.toInputSource()).getTextDocument();
extraction.setTitle(doc.getTitle()); // obtaining title
ArticleExtractor.INSTANCE.process(doc); // obtaining content
SplitParagraphBlocksFilter.INSTANCE.process(doc);
contentBuilder.setLength(0);
for(TextBlock block : doc.getTextBlocks()) {
blockString = "<p>" + block.getText() + "</p>";
contentBuilder.append(blockString);
}
extraction.setContent(contentBuilder.toString());
// obtaining image
extractor = CommonExtractors.ARTICLE_EXTRACTOR;
ie = ImageExtractor.INSTANCE;
imgUrls = ie.process(new URL(url), extractor);
extraction.setImgUrls(imgUrls);
//
signalUpdate(STATE.Complete);
Actually, what I mean by "fail" is:
I could fetch images from some web sites. However, I could not get image in this article mentioned above.