Questions tagged [boilerpipe]

The boilerpipe library for Java provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.

The boilerpipe library for Java provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.

77 questions
2
votes
2 answers

Install attempt of Boilerpipe-py3 gives 404 error

Boilerpipe is a great Java program for cleaning web pages and I've used it in the past. I note today that many users are not able to install the Python wrapper version and get 404 and other errors. Here is one of my attempts which I copied from…
2
votes
0 answers

Python Multiprocessing Process isn't getting killed even after the task is done

I have written a python script which will read from an Amazon SQS and create as many parallel processes as user wanted. It inherits Django BaseCommand, and this is the code. def handle(self, *args, **kwargs): self.set_up(*args, **kwargs) …
2
votes
2 answers

Something like boilerpipe for python3

I need a general tool, to extract a content from HTML documents. For python2 the boilerpipe is usually recommended. Is there any similar alternative for python3?
Dejwi
  • 4,393
  • 12
  • 45
  • 74
2
votes
0 answers

How to install jpype1(from boilerpipe) library from python33?

I am trying to install python library boilerpipe i.e. pip install boilerpipe but i am getting error i.e command 'gcc' failed with exit status 1 So what can i do fot this error?
Prush
  • 517
  • 1
  • 6
  • 21
2
votes
1 answer

JVM crashes while implementing Python-Boilerpipe in Flask app

Im writing a flask app using boilerpipe to extract content.Initially i wrote the boilerpipe extract as script to extract website content but when i try to integrate with my api JVM crashes when executing boilerpipe extractor . This is the error i…
shiva
  • 434
  • 4
  • 21
2
votes
1 answer

Im trying using boilerpipe library for article extraction in java

package com.index; import java.net.URL; import com.opensymphony.xwork2.ActionSupport; import de.l3s.boilerpipe.extractors.ArticleExtractor; public class search_article extends ActionSupport { /** * */ private static final long serialVersionUID…
2
votes
0 answers

How to retain the original html format when extracting content from web pages with boilerpipe?

I could extract the title and content (paragraphed) from the web pages on my Android application, but fail in fetching images sometimes. However, I could not find a way to retain its html format parameters (e.g. bold, with a hyperlink, underline, or…
jct
  • 21
  • 3
2
votes
0 answers

Trouble by running boilerpipe library in python

I'have tried to use boilerpipe library on python aiming to extract text from pages for a college project. I created a simple code to make the extraction that is: from boilerpipe.extract import Extractor def Article(url): extractor =…
2
votes
1 answer

the HtmlHighlighter of boilerpipe in .net is not returning the text always

am using Boilerpipe in my application, and when am trying to extract the content using ArticleExtractor am getting plane text only, all the html formating has been removed, so am trying with HtmlHighlighter. but the process method of HtmlHighlighter…
user1685989
  • 33
  • 1
  • 5
1
vote
0 answers

Boilerpipe Server Error when trying to extract url content

I'm trying to use boilerpipe to extract the content from a given url. When I try the demo ui it returns a server error: Demo window The same error is returned when making the call to the api. Does anyone have the same problem? Can it be related to…
pterodactella
  • 121
  • 12
1
vote
1 answer

How to get the main content of an article from HTML using boilerplate?

I am trying to get the main content of an article from an HTML using boilerpipe code. Downloaded the latest jars from here. I am trying to use the following code: String article = ""; try { article = ArticleExtractor.INSTANCE.getText(url); …
Pritam Banerjee
  • 17,953
  • 10
  • 93
  • 108
1
vote
0 answers

Boilerpipe API getting unimportant text

Currently, I'm attempting to use the boilerpipe APl in order to extract text from news articles. However, it doesn't fully work. For example, see this link. Even though boilerpipe gets all of the main text, it also gets some of the unimportant text…
Tdonut
  • 153
  • 5
1
vote
1 answer

How to get result of BoilerPipe extraction in HTML instead of plain text

I'm using the following code to extract the textual contents from the web pages, my app is hosted on Google App Engine and works exactly like BoilerPipe Web API. The problem is that I can only get the result in plain text format. I played around the…
ashif-ismail
  • 1,037
  • 17
  • 34
1
vote
1 answer

java web crawler downloads too many GB data

I have coded a web crawler. But when crawling it downloads too many GBs of data. I want to read only the text (avoiding images ...etc). I use Boilerpipe to extract the content from html Here is how I find the final redirected url public String…
Buntu Linux
  • 492
  • 9
  • 19
1
vote
0 answers

Install nodejs boilerpipe module on windows?

I am trying to install node js module for boilerpipe on windows 7. But the npm result with the following error. I have installed the node-gyp as well. gyp WARN install got an error, rolling back install gyp ERR! configure error gyp ERR! stack Error:…
user3373581
  • 1,141
  • 2
  • 7
  • 11