2

I could extract the title and content (paragraphed) from the web pages on my Android application, but fail in fetching images sometimes.

However, I could not find a way to retain its html format parameters (e.g. bold, with a hyperlink, underline, or font size, etc..) in the extractor.

That is, if a sentence in the web page is equipped with bold, a hyperlink, or underline, how could I extract BOTH the sentence itself and its format parameters ?

I tried this page: An article both by the Web-API and APIs in local jar.

I would like to get the same result using local APIs as what Web-API did.

Could anyone share your experiences to this issue?

Much thanks,

James


Edit #1

Here are the codes:

signalUpdate(STATE.Start);

//
htmlDoc = HTMLFetcher.fetch(new URL(url));

//                  
doc = new BoilerpipeSAXInput(htmlDoc.toInputSource()).getTextDocument();
extraction.setTitle(doc.getTitle());        // obtaining title

ArticleExtractor.INSTANCE.process(doc);     // obtaining content
SplitParagraphBlocksFilter.INSTANCE.process(doc);

contentBuilder.setLength(0);

for(TextBlock block : doc.getTextBlocks()) {
    blockString = "<p>" + block.getText() + "</p>";
    contentBuilder.append(blockString);
}

extraction.setContent(contentBuilder.toString());

// obtaining image
extractor = CommonExtractors.ARTICLE_EXTRACTOR;
ie = ImageExtractor.INSTANCE;
imgUrls = ie.process(new URL(url), extractor);
extraction.setImgUrls(imgUrls);

//
signalUpdate(STATE.Complete);

Actually, what I mean by "fail" is:

I could fetch images from some web sites. However, I could not get image in this article mentioned above.

jct
  • 21
  • 3

0 Answers0