Questions tagged [webharvest]

Web-Harvest is Open Source Web Data Extraction tool written in Java.

Web-Harvest is Open Source Web Data Extraction tool written in Java.

It offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web-Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content. On the other hand, it could be easily supplemented by custom Java libraries in order to augment its extraction capabilities.

71 questions
1
vote
1 answer

How to webscrape share counts in R

I am trying to download the share count from the left SumoMe plugin of this website http://www.r-bloggers.com/erum-2016-first-european-conference-for-the-programming-language-r/ I try to use R code based on rvest package > library(rvest) Loading…
Marcin
  • 7,834
  • 8
  • 52
  • 99
1
vote
1 answer

Webharvest crawler script not creating XML file

I'm hoping someone can point out my (probably stupid) problem with this script. I'm trying to crawl a website to get the posts on the site and to load this into an XML document. I have tried to combine a couple of example scripts - the crawler and…
zag2010
  • 439
  • 2
  • 4
  • 10
1
vote
2 answers

Disabling XML validation in WebHarvest

I have a mobile application already published in the Apple's app store. This SPI client app uses a Rest API in the server side to retrieve real time information regarding buses arrivals in a specific bus stop. The app was working like a charm for 6…
1
vote
1 answer

Webharvest if/else and try/catch always succeeding

I'm working on a project where I need to harvest some data from website, so I'm using webharvest. I'm running into a problem where the data I'm harvesting (comments from news websites) is sometimes across more than one page. I'm trying to configure…
Jangari
  • 690
  • 4
  • 12
1
vote
1 answer

Scrape data from website using webharvest

I am trying to scrape all html pages from website "http://www.tecomdirectory.com/" using webharvest. But the script fails to grab all the html pages and only scrapes few of the html pages. I am using the following script: