Questions tagged [boilerpipe]

The boilerpipe library for Java provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.

The boilerpipe library for Java provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.

77 questions
1
vote
0 answers

What is the best regular expression or other simple ways to extract an article content from a webpage in HTML or PHP source?

There are many scripts extracts articles from html pages. If using regular expression to get the only main article from html or PHP page source, what is the best regular expressions to get only the main article. Also, what is the simplest and the…
john3825
  • 11
  • 2
1
vote
0 answers

Boilerpipe used in Android Causes Error: Conversion to Dalvik format failed

I put 'boilerpipe-1.2.0-android.jar'(https://code.google.com/p/boilerpipe/issues/detail?id=57), 'nekohtml-1.9.13.jar', 'xerces-2.9.1.jar' into libs folder of my Android Project. But it caused "Conversion to Dalvik format failed" Error. So, I did all…
1
vote
1 answer

Using boilerpipe on Android application

I'm trying to use boilerpipe in an Android application. I have included the libraries boilerpipe-1.2.0, nekohtml-1.9.13, xerces-2.9.1 in the libs folder. When running the application with Eclipse i get the following error: Conversion to Dalvik…
Enry_h2o
  • 31
  • 1
  • 1
1
vote
1 answer

How can i set "No JAVA_HOME Environment Variable set. Trying to guess it..."?

i am trying to install python library(boilerpipe): pip install boilerpipe. But i am getting the error: "No JAVA_HOME Environment Variable set. Trying to guess it..." which was i already set the java path. So what can i do for this????
Prush
  • 517
  • 1
  • 6
  • 21
1
vote
0 answers

Custom output form boilerpipe - Transform

into two newlines

Im using boilerpipe to extract Text form Websites. ArticleSentencesExtractor.getInstance().getText(inputHTMLStream) I dont see any customization possibilities. I would like to separate

sentence

elements with two newlines. Is that possible -…
LukeSolar
  • 3,795
  • 4
  • 32
  • 39
1
vote
2 answers

Using boilerpipe in Android

Boilerpipe is a library that basically extracts the main content from a webpage. For news websites, it is especially hard to extract the content as the formatting differs from site to site. So I've tried to integrate the boilerpipe library -…
Auge
  • 65
  • 1
  • 4
  • 11
1
vote
0 answers

Determining type of a string

I'm looking for some way to determine the type of a string from any article website such as this one. Types would be title, author, date, article itself. I use BeautifulSoup and Boilerpipe to scrape the relevant content: from boilerpipe.extract…
Paul Chen
  • 95
  • 1
  • 1
  • 6
1
vote
2 answers

How can I get HTML output from NBoilerPipe?

NBoilerPipe is a Mono port of the BoilerPipe Java library. I've managed to get this working in .NET 4 without too much trouble (a few library references needed fixing/etc). However, searching through the code, I cannot find any 'hooks' for HTML…
winwaed
  • 7,645
  • 6
  • 36
  • 81
1
vote
1 answer

Extract HTML article text with inline CSS

I want to extract text from crawled html web pages. I am using the excellent open source Boilerpipe library to do just that. However, with Boilerpipe I am getting only the raw text. In addition to the raw text, I need to capture the text with…
cosmos
  • 2,414
  • 4
  • 23
  • 25
0
votes
2 answers

Trouble installing Boilerpipe

This is the third time I've installed it. I had it working on Windows, and up until a few days ago on Linux. I've done all I can do and I don't understand how to run this Java program. The source code is a folder with a lib, src some jars and a…
user723220
  • 817
  • 3
  • 12
  • 20
0
votes
1 answer

Unsuported browser agent when crawling TripAdvisor with boilerpipe

I'm programming a generic webcrawler that gets the main content from a given webpage (it has to crawl different pages). I've tried to achieve this with different tools, among them: HtmlUnit: returned me too much scrap when crawling. Essence: failed…
0
votes
0 answers

Exception when getting HTML from URL

I'm trying to get HTML from a URL so I can strip it down using Boilerpipe. However, I keep on getting an exception. I am using the NewsAPI to get my URLs. Here is the relevant code snippet: foreach (var article in articlesResponse.Articles) { …
esb5415
  • 25
  • 5
0
votes
0 answers

Tomcat Application throws java.lang.ClassNotFoundException for Jar in WEB-INF/lib

I'm trying to add Boilerpipe to do web scraping with my Tomcat project, but when I do so I tend to run into a problem. I add the jar as well as the necessary resources (nekohtml-1.9.13.jar and xerces-2.9.1.jar) to my Web-INF/lib folder and as an…
bluesquare
  • 76
  • 1
  • 8
0
votes
1 answer

What is the issue with this attempt to install boilerpipe3 for Python?

There are three venues (PCs or Severs) where I wish to install boilerpipe3 for Python. Each venue is running Windows 10, Python 3 and has almost the same environment set up in each. I have manged to install boilerpipe3 (via pip install) in two…
agftrading
  • 784
  • 3
  • 8
  • 21
0
votes
2 answers

Why cannot I pip install a Python3 package?

I am new to Python (3) using Windows 10, 64. When trying to install a package, I get the long error message pasted below. What should I do? (base) C:\Users\xxx>pip install boilerpipe-py3 Collecting boilerpipe-py3 Using cached…
user1774127