Questions tagged [webharvest]

Web-Harvest is Open Source Web Data Extraction tool written in Java.

Web-Harvest is Open Source Web Data Extraction tool written in Java.

It offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web-Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content. On the other hand, it could be easily supplemented by custom Java libraries in order to augment its extraction capabilities.

71 questions
0
votes
1 answer

defining array variable in web-harvest

I'm using Web-Harvest to extract some data from a site. Site gets a POST variable named Code and gives data according to it. The available codes are gathered from another page of that site. How Can I define an array like variable to store those data…
Ariyan
  • 14,760
  • 31
  • 112
  • 175
0
votes
1 answer

Is it possible to retrieve a single page from PDF document via GET request?

I need to migrate a digital repository to a new platform, but lack access to the old platform so I have resorted to retrieving the objects over the web. Some objects contain other objects. For most objects of this type, identifying/retrieving the…
Kyle Banerjee
  • 2,554
  • 4
  • 22
  • 30
0
votes
2 answers

Harvesting hebrew names from a group of websites

I have the following website (hebrew): http://www.daydeals.co.il/ It contains many links to external websites. I want to write a jQuery script that will 1) open all the links 2) collect the elements from all the open websites that contains the text…
Elad Benda
  • 35,076
  • 87
  • 265
  • 471
0
votes
2 answers

character (0) after scraping webpage in read_html

I'm trying to scrape "1,335,000" from the screenshot below (the number is at the bottom of the screenshot). I wrote the following code in R. t2<-read_html("https://fortune.com/company/amazon-com/fortune500/") employee_number <- t2 %>% …
Xian Zhao
  • 81
  • 1
  • 11
0
votes
1 answer

Gson.toJson() method returns string embedded with "data"

I am working on Workfusion which is one of the automation tool and it consists of selenium, java and Web -harvest technology. Below is the snipped of my code.
Nitin Rathod
  • 159
  • 2
  • 15
0
votes
1 answer

Rvest returns zero list

I want to download all links/ titles of papers from the web using rvest. I used the following script but it is not the list is zero. Any suggestions? library(rvest) 1. Download the HTML and turn it into an XML file with read_html() Papers <-…
Ehsan
  • 11
  • 3
0
votes
1 answer

Creating a regex with special characters in Web Harvest

I am using web harvest (http://web-harvest.sourceforge.net/), the open source web scraping tool. The regex I am trying to use has "<", ">" characters (because I am trying to strip out all HTML tags that come in). This causes a problem because the…
kburns
  • 782
  • 2
  • 8
  • 22
0
votes
2 answers

Angular 4 how to request a web page content as a json object

I am trying to request a web page with a http call and harwest the data. I could avoid the cross-origin with a chrome plug-in but still when I make the request, the response is always "null". How could I fetch a html page inside my angular app as…
tlq
  • 887
  • 4
  • 10
  • 21
0
votes
1 answer

Xquery error in WebHarvest

I'm using WebHarvest to parse some html. I get the following error in WebHarvest's ide on the function that follows, and I don't understand what's wrong. I'm trying to create a function that trims a string. Error: Error executing XQuery…
cdarwin
  • 4,141
  • 9
  • 42
  • 66
0
votes
1 answer

Getting response headers with Java, encoding issue

I am using Webharvest to download a file from a website and take its original name. The Java code that I am working with is: import org.apache.commons.httpclient.Header; import org.apache.commons.httpclient.HttpClient; import…
linderman
  • 149
  • 1
  • 9
0
votes
0 answers

XSLT error when getting variable

I am trying to assign variables into xsl file and use them later in this file. Here is how i assign them:
linderman
  • 149
  • 1
  • 9
0
votes
2 answers

Scraping a google search page for the top 10 search links for a keyword

i want to scrape the top 10 search links from a google page on searching a keyword. i am using webharvest . Planning to scrape the href links and filter out the top 10 using some attribute pattern? Is it the right way,its not working at the moment.…
sanre6
  • 797
  • 2
  • 11
  • 28
0
votes
1 answer

How to replace string after extraction from WebHarvest?

I wanted to insert the records I had extracted from website to DB, but the extraction text contained the symbol apostrophe, and had caused me syntax error during sql insertion. May I know how to replace apostrophe with "’" instead in…
Jazz
  • 21
  • 3
0
votes
1 answer

How to get large pictures in Google image

I want to collect pictures from Google image search. However, I am constantly notified with an error. For example, the URL https://www.google.com/search?q=banana&hl=en&gws_rd=ssl&tbm=isch is fine in my browser, but in web harvest it reports…
0
votes
1 answer

What is the difference between the exitExecution() and stopExecution() in Webharvest Scraper class

I want to know what is the difference between the scraper.exitExecution() and scraper.stopExecution() and scraper.finishExecutingProcessor() I have tried looking in to the java doc, I could not find anything over there. There seems to be no…
codeMan
  • 5,730
  • 3
  • 27
  • 51