1

I'm using the following code to extract the textual contents from the web pages, my app is hosted on Google App Engine and works exactly like BoilerPipe Web API. The problem is that I can only get the result in plain text format. I played around the library to find a work around, but I couldn't find a method to display the result in HTML. What I am trying to have is to include a option like HTML (extract mode) as in the original BoilerPipe Web API here.

This is the code I'm using for extracting the plain text.

 PrintWriter out = response.getWriter();
    try {
        String urlString = request.getParameter("url");
        String listOUtput = request.getParameter("OutputType");
        String listExtractor = request.getParameter("ExtractorType");
        URL url = new URL(urlString);
        switch (listExtractor) {
            case "1":
                String mainArticle = ArticleExtractor.INSTANCE.getText(url);
                out.println(mainArticle);
                break;
            case "2":
                String fullArticle = KeepEverythingExtractor.INSTANCE.getText(url);
                out.println(fullArticle);
                break;
        }
    } catch (BoilerpipeProcessingException e) {
        out.println("Sorry We Couldn't Scrape the URL you Entered " + e.getLocalizedMessage());
    } catch (IOException e) {
        out.println("Exception thrown");
    }

How can I include the feature for displaying the result in HTML form?

BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
ashif-ismail
  • 1,037
  • 17
  • 34

1 Answers1

3

i am using the source code of Boilerpipe, and solve your question with the following code:

String urlString = "your url";
URL url = new URL(urlString);
URI uri = new URI(urlString);

final HTMLDocument htmlDoc = HTMLFetcher.fetch(url);

final BoilerpipeExtractor extractor = CommonExtractors.DEFAULT_EXTRACTOR;

final HTMLHighlighter hh = HTMLHighlighter.newExtractingInstance();
hh.setOutputHighlightOnly(true);

TextDocument doc;

String text = "";

doc = new BoilerpipeSAXInput(htmlDoc.toInputSource()).getTextDocument();
extractor.process(doc);
final InputSource is = htmlDoc.toInputSource();
text = hh.process(doc, is);

System.out.println(text);

Source