Questions tagged [webharvest]

Web-Harvest is Open Source Web Data Extraction tool written in Java.

Web-Harvest is Open Source Web Data Extraction tool written in Java.

It offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web-Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content. On the other hand, it could be easily supplemented by custom Java libraries in order to augment its extraction capabilities.

71 questions
0
votes
2 answers

Using regex in web harvest xml

I'm using web harvest to scrap some e-commerce site.I'm iterating over the search page and getting each product details in output xml.But now I want to use regular expression in anchor(a) tag while scraping and get particular string.i.e., let…
Ashwini
  • 41
  • 2
  • 10
0
votes
1 answer

Web harvest -- remove unusual characters

I'm trying to scrape a page that has some spaces after the anchors:   |   I can't seem to find a way to specify the text, and I either trigger a processor error, or I fail to detect the string itself. Everything AFTERthe …
user991945
  • 103
  • 1
  • 3
  • 14
0
votes
1 answer

webharvest implementation in eclipse

I have a XML config (ScreenScraper) that does what I want correctly in the executable version of WebHarvest. I am confused on how to execute it through Java.
stacktraceyo
  • 1,235
  • 4
  • 16
  • 22
0
votes
1 answer

Using the right Web Scraper

I need to make a web scraper that uses an input address from the client, and then retrives data from that address from a specific site. I downloaded Webharvest, is that the right thing to begin with to learn how to write the program to do it? Also,…
stacktraceyo
  • 1,235
  • 4
  • 16
  • 22
0
votes
1 answer

Web Scraping with Web-Harvest

I am trying to write a web scraper using web-harvest library to get params from realtor.com. Are there any good tutorials for how to do it? I am using the eclipse IDE
stacktraceyo
  • 1,235
  • 4
  • 16
  • 22
0
votes
1 answer

Scraping content of webpage using Web-harvest

I want to scrape particular contents from webpages, for this I am using web harvest. It is working well for other website when I tried to scrape contents but it is not scraping contents for this URL. My Java code is here: import…
kailash gaur
  • 1,407
  • 3
  • 15
  • 28
0
votes
1 answer

What is wrong with my web harvest authentication config?

I have recently started using Web-Harvest as a web scraping tool. Currently, I am working in the beginning of a project where I want to authenticate / log in to a web site. Before I begin I want to make clear that [URL] in the code replaces the…
johansson.lc
  • 322
  • 2
  • 12
-1
votes
1 answer

learning Data harvesting

I want to build a website that will harvest data from: *facebook status of my friends *other website Unfortenatly, I don't know how to harvest data. Can someone recommend of a book\tutorial ? How to approch this field?
Elad Benda
  • 35,076
  • 87
  • 265
  • 471
-1
votes
1 answer

HtmlUnit scraping google+ page javascript. Click show more button not working

i am trying to scrap this page https://plus.google.com/115016587855962294424/about. Everything works fine but when i try to click show more to load more reviews nothing happens here is my code final WebClient webClient = new…
tariq.freeman
  • 176
  • 2
  • 6
-1
votes
1 answer

Limiting list returned by xpath

I am trying to use xpath in WebHarvest and I am able to receive a large list of data, however I only need the first 5 strings returned.
-2
votes
2 answers

How can I add the next list to do a sum and avarage?

I got the following code: def arrayOfInts = [actual_speed_mobile_1.toInteger(), actual_speed_mobile_2.toInteger(), actual_speed_mobile_3.toInteger(), actual_speed_mobile_4.toInteger(), actual_speed_mobile_5.toInteger()] "this is…
iollo
  • 47
  • 8
1 2 3 4
5