Questions tagged [web-scraping]

Web scraping is the process of extracting specific information from websites that do not readily provide an API or other methods of automated data retrieval. Questions about "How To Get Started With Scraping" (e.g. with Excel VBA) should be *thoroughly researched* as numerous functional code samples are available. Web scraping methods include 3rd-party applications, development of custom software, or even manual data collection in a standardized way.

Web scraping (also known as web harvesting, web mining or web data extraction) is the act of using programming to extract information from the web.

Web scraping works by requesting HTML pages from a website and extracting specific data by taking advantage of patterns in the HTML markup, or by embedding a fully-fledged web browser. More advanced systems of web scraping, namely with regards to magnitude, scheduling, and automation, are often referred to as spiders, or web crawlers.

Potential uses include:

Retrieving product or stock prices comparison for comparison,
Contact scraping and collecting email addresses,
Site mashup or building an alternative front-end for an existing site,
Collection of real-estate pricing or auto sales statistics,
Website change detection
Building archives of dead pages

The practice of Web scraping has drawn a lot of controversies because the terms of use or copyrights for some websites and electronic publications do not allow certain kinds of data mining. While web scraping is not illegal on its own, legal issues can arise if being done with malicious or plagiaristic intentions, to circumvent a site's purchasing system or subscription fees, or other fraudulent or maligned purposes.

There have been numerous cases of lawsuits and other legal actions against companies and individuals. Before attempting to extract any information from a website in a method that is potentially contrary to the sites' indented usage, it is important to exercise due diligence in educating yourself in applicable local and international laws as well as the site's terms of service, copyrights, and trademarks. Further discussion on legal implications is can be found online including Wikipedia, Hacker News and Laws.com.

Web crawling is a component of web scraping across multiple sites, indexing information on the web using a bot or "spider" and is a universal technique adopted by most search engines while honoring exclusion requests such as those published a robots.txt file place on the site.

In contrast, web scraping focuses more on the transformation of unstructured data on the web, typically from HTML into a structured form that can be more easily stored, manipulated, and analyzed using tools such as a database or spreadsheet.

Screen scraping has a similar purpose but involves the programmatic collection of visual data from a source (as opposed to parsing data as in Web scraping) and originally involved reading the terminals memory or video data by connecting the terminals to another computer's input port.

web-scraping is most often tagged along with:

➡ ^{python ( including beautifulsoup, scrapy and selenium )}
➡ ^{javascript ( including node.js and phantomjs )}
➡ ^{r ( including rvest )}
➡ ^selenium
➡ ^{xml ( including xpath )}
➡ ^{java ( including jsoup )}
➡ ^php
➡ ^{vba (including vba-excel)}

A note on spelling

The verb is spelled to scrape, or as the present participle scraping, and is not to be confused with to scrap or scrapping, which is to discard something you no longer want or need, or to not continue with a plan.

Wikipedia: Web scraping
^{Overview of the types of web scraping, as well as techniques, software, legalities and prevention.}
GitHub: Guide to Preventing Web Scraping
^{Detailed advice on prevention of web scraping. (Original article on Stack Overflow)}
HartleyBrody: I Don't Need No Stinking API
^{A commercial blogger's views, advice and tips for scraping}
Stack Overflow: Scraping data from website using VBA
^{Discussion and examples of getting started scraping with VBA}
SitePoint: Web Scraping for Beginners
^{Theory and examples for beginner to web scraping}

49536 questions

votes

1 answer

Puppeteer - Async function in evaluate method throws error

I am trying to check if og:image source exists. If I want to call async method in evaluate function, I get Error: Evaluation failed: [object Object] error. Error: Evaluation failed: [object Object] at ExecutionContext._evaluateInternal…

javascript node.js web-scraping puppeteer

asked Sep 12 '19 at 14:12

Matt

8,195
31
115
225

votes

2 answers

How to get href from tag which contains JavaScript using Python?

I am trying to get href from a tag using Python + Selenium, but the href is having "JavaScript" in it. So I am unable to get the target URL. I am using Python 3.7.3, selenium 3.141.0. HTML:

javascript python selenium web-scraping

asked Sep 11 '19 at 06:51

m.gibin

votes

2 answers

Single Scrapy Project vs. Multiple Projects

I have this dilemma on how to store all of my spiders. These spiders will be used by fed into Apache NiFi using a command line invocation and items read from stdin. I also plan to have a subset of these spiders return single item results using…

python web-scraping scrapy screen-scraping scrape

asked Sep 09 '19 at 21:19

Lijo

votes

1 answer

"PATH to JAVA not found. Please check JAVA is installed." error when initialising RSelenium

I am trying to start an RSelenium session to webscrape. However, when running this code: driver <- rsDriver(browser=c("chrome"), chromever="76.0.3809.126", port = 4444L) I get this error: Error in java_check() : PATH to JAVA not found. Please…

java r web-scraping

asked Sep 09 '19 at 10:53

natedjurus

votes

4 answers

getting a substring from each element of a list

I'm trying to create a list of filter facets. I've loaded all the in to a list with bs4 and now need to grab a specific substring out of the larger string that is the . I want to load each filter facet name in to a list to end up with a…

python html python-3.x web-scraping beautifulsoup

asked Sep 06 '19 at 18:07

LvP

votes

1 answer

Click on HTMLelement if condition is satisfied

i'm wondering how can i manage to click on an html element through VBA if another condition is satisfied. To make it clear, i will show you a short example: i need to analize data in a specific quarter ('let's say i need Q2) and for each quarter…

html vba web-scraping

asked Aug 24 '19 at 10:44

Zakiirim

votes

1 answer

Unable to let my script run through the end

I've written a script in vba using ServerXMLHTTP requests in order to be able to use proxy along with setting timeout parameter within it. When I run the script, it appears to be working but the problem is - it gets stuck after using the first…

vba web-scraping serverxmlhttp queryselector

asked Aug 19 '19 at 15:26

robots.txt

votes

1 answer

How can I bypass a cookie agreement page while web scraping using Python?

I hurt my nose to a cookie agreement page... What I am doing: import requests url = "https://stockhouse.com/community/bullboards/" r = requests.get(url) soup = BeautifulSoup(r.content, "html.parser") print(soup) which returns HTML from a cookie…

python web-scraping python-requests

asked Aug 12 '19 at 13:21

Vincent Labrecque

votes

2 answers

How to scrape
tag with class like active/selected?

I'm trying to scrape a list from a website. There are two different lists, and one will load only after the first option is chosen. Issue is, I'm unable to select the first option. I scraped the list of all available options. But after writing it, I…

python selenium web-scraping

asked Aug 07 '19 at 06:24

howdy Angel

votes

2 answers

Can't scrape the links of different companies from a website using requests

I'm trying to get the links of different companies from a webpage but the script I've tried with throws the error below. In chrome dev tools I could see that I can get the ids of different companies using post http requests. However, if I can get…

python python-3.x web-scraping

asked Jul 28 '19 at 14:52

MITHU

votes

1 answer

POST request with httr package using R

I would like to get the output from POST request using httr from following site: http://www.e-grunt.ba You can see submit form when you click "ZK Ulošci". There I would like to send POST request and get the output. For example, you can select…

r post web-scraping httr

asked Jul 21 '19 at 18:58

Mislav

1,533
16
37

votes

2 answers

Web scraping with python how to get to the text

I'm trying to get the text from a website but can't find a way do to it. How do I need to write it? link="https://www.ynet.co.il/articles/0,7340,L-5553905,00.html" response = requests.get(link) soup = BeautifulSoup(response.text,'html.parser') info…

python python-3.x web-scraping python-requests

asked Jul 20 '19 at 07:13

Michael

votes

2 answers

Download CSV file from results page with options from dropdown menu

I am a novice at web scraping with R and I am stuck on this problem: I want to use R to submit a search query to PubMed, then download a CSV file from the results page. The CSV file can be accessed by clicking 'Send to', which opens a dropdown menu,…

html r web-scraping rvest httr

asked Jul 12 '19 at 17:03

kstew

1,104
6
21

votes

1 answer

How to close newly constructed tab using selenium, chrome driver and python

I am trying to scrape data from a website, there is an url which lands me a particular page, there we have links of some items, if I click on those links, it opens in a new tab, and I can extract data from there, But after extracting the data, I…

python selenium selenium-webdriver web-scraping selenium-chromedriver

asked Jul 12 '19 at 10:45

Kallol

2,089
3
18
33

votes

2 answers

How can I obtain amino acid sequence from this URL?

I want to obtain amino acid sequence from below url by using python and Selenium, but couldn't succeed. http://flybase.org/download/sequence/FBgn0003719/FBpp I've tried u Beautiful Soup and Selenium. from selenium import webdriver driver =…

python web-scraping

asked Jul 08 '19 at 01:56

hinatafly

Prev 1 2 3

…

99 100 Next

Questions tagged [web-scraping]

A note on spelling

Further Reading: