Questions tagged [web-scraping]

Web scraping is the process of extracting specific information from websites that do not readily provide an API or other methods of automated data retrieval. Questions about "How To Get Started With Scraping" (e.g. with Excel VBA) should be *thoroughly researched* as numerous functional code samples are available. Web scraping methods include 3rd-party applications, development of custom software, or even manual data collection in a standardized way.

Web scraping (also known as web harvesting, web mining or web data extraction) is the act of using programming to extract information from the web.

Web scraping works by requesting HTML pages from a website and extracting specific data by taking advantage of patterns in the HTML markup, or by embedding a fully-fledged web browser. More advanced systems of web scraping, namely with regards to magnitude, scheduling, and automation, are often referred to as spiders, or web crawlers.

Potential uses include:

Retrieving product or stock prices comparison for comparison,
Contact scraping and collecting email addresses,
Site mashup or building an alternative front-end for an existing site,
Collection of real-estate pricing or auto sales statistics,
Website change detection
Building archives of dead pages

The practice of Web scraping has drawn a lot of controversies because the terms of use or copyrights for some websites and electronic publications do not allow certain kinds of data mining. While web scraping is not illegal on its own, legal issues can arise if being done with malicious or plagiaristic intentions, to circumvent a site's purchasing system or subscription fees, or other fraudulent or maligned purposes.

There have been numerous cases of lawsuits and other legal actions against companies and individuals. Before attempting to extract any information from a website in a method that is potentially contrary to the sites' indented usage, it is important to exercise due diligence in educating yourself in applicable local and international laws as well as the site's terms of service, copyrights, and trademarks. Further discussion on legal implications is can be found online including Wikipedia, Hacker News and Laws.com.

Web crawling is a component of web scraping across multiple sites, indexing information on the web using a bot or "spider" and is a universal technique adopted by most search engines while honoring exclusion requests such as those published a robots.txt file place on the site.

In contrast, web scraping focuses more on the transformation of unstructured data on the web, typically from HTML into a structured form that can be more easily stored, manipulated, and analyzed using tools such as a database or spreadsheet.

Screen scraping has a similar purpose but involves the programmatic collection of visual data from a source (as opposed to parsing data as in Web scraping) and originally involved reading the terminals memory or video data by connecting the terminals to another computer's input port.

web-scraping is most often tagged along with:

➡ ^{python ( including beautifulsoup, scrapy and selenium )}
➡ ^{javascript ( including node.js and phantomjs )}
➡ ^{r ( including rvest )}
➡ ^selenium
➡ ^{xml ( including xpath )}
➡ ^{java ( including jsoup )}
➡ ^php
➡ ^{vba (including vba-excel)}

A note on spelling

The verb is spelled to scrape, or as the present participle scraping, and is not to be confused with to scrap or scrapping, which is to discard something you no longer want or need, or to not continue with a plan.

Wikipedia: Web scraping
^{Overview of the types of web scraping, as well as techniques, software, legalities and prevention.}
GitHub: Guide to Preventing Web Scraping
^{Detailed advice on prevention of web scraping. (Original article on Stack Overflow)}
HartleyBrody: I Don't Need No Stinking API
^{A commercial blogger's views, advice and tips for scraping}
Stack Overflow: Scraping data from website using VBA
^{Discussion and examples of getting started scraping with VBA}
SitePoint: Web Scraping for Beginners
^{Theory and examples for beginner to web scraping}

49536 questions

votes

8 answers

Puppeteer - Protocol error (Page.navigate): Target closed

As you can see with the sample code below, I'm using Puppeteer with a cluster of workers in Node to run multiple requests of websites screenshots by a given URL: const cluster = require('cluster'); const express = require('express'); const…

node.js web-scraping puppeteer google-chrome-headless node-cluster

asked Aug 01 '18 at 08:54

LioRz

votes

10 answers

How to "scan" a website (or page) for info, and bring it into my program?

Well, I'm pretty much trying to figure out how to pull information from a webpage, and bring it into my program (in Java). For example, if I know the exact page I want info from, for the sake of simplicity a Best Buy item page, how would I get the…

java html web-scraping jsoup

asked May 14 '10 at 15:48

James

5,622
9
34
42

votes

10 answers

How do you scrape AJAX pages?

Please advise how to scrape AJAX pages.

ajax web-scraping

asked Nov 04 '08 at 01:25

xxxxxxx

5,037
6
28
26

votes

3 answers

Scraping a JSON response with Scrapy

How do you use Scrapy to scrape web requests that return JSON? For example, the JSON would look like this: { "firstName": "John", "lastName": "Smith", "age": 25, "address": { "streetAddress": "21 2nd Street", "city":…

python json web-scraping scrapy

asked Aug 11 '13 at 12:20

Thomas Kingaroy

votes

6 answers

How to give delay between each requests in scrapy?

I don't want to crawl simultaneously and get blocked. I would like to send one request per second.

python web-scraping scrapy

asked Jan 07 '12 at 08:44

nizam.sp

4,002
5
39
63

votes

5 answers

How can I get the CSS Selector in Chrome?

I want to be able to select/highlight an element on the page and find its selector like this: div.firstRow div.priceAvail>div>div.PriceCompare>div.BodyS I know you can see the selection on the bottom after doing an inspect element, but how can I…

google-chrome web-scraping

asked Dec 21 '10 at 14:59

kale

1,161
1
9
16

votes

4 answers

Web Scraping With Haskell

What is the current state of libraries for scraping websites with Haskell? I'm trying to make myself do more of my quick oneoff tasks in Haskell, in order to help increase my comfort level with the language. In Python, I tend to use the excellent…

haskell html-parsing web-scraping

asked Jan 29 '11 at 17:02

ricree

35,626
13
36
27

votes

11 answers

Java HTML Parsing

I'm working on an app which scrapes data from a website and I was wondering how I should go about getting the data. Specifically I need data contained in a number of div tags which use a specific CSS class - Currently (for testing purposes) I'm…

java html parsing web-scraping

asked Oct 26 '08 at 13:57

Richard Walton

4,789
3
38
49

votes

9 answers

How to get the scrapy failure URLs?

I'm a newbie of scrapy and it's amazing crawler framework i have known! In my project, I sent more than 90, 000 requests, but there are some of them failed. I set the log level to be INFO, and i just can see some statistics but no details.…

python web-scraping report scrapy

asked Dec 05 '12 at 13:49

Joe Wu

votes

2 answers

Puppeteer Execution context was destroyed, most likely because of a navigation

I am facing this problem in puppeteer in a for loop when i go on another page to get data, then when i go back it comes me this error line: Error "We have an error Error: the execution context was destroyed, probably because of a navigation." It's…

javascript node.js web-scraping puppeteer

asked Apr 27 '19 at 04:32

Salah Eddine Bentayeb

votes

11 answers

Fetch all href link using selenium in python

I am practicing Selenium in Python and I wanted to fetch all the links on a web page using Selenium. For example, I want all the links in the href= property of all the tags on http://psychoticelites.com/ I've written a script and it is working.…

python selenium selenium-webdriver web-scraping

asked Jan 13 '16 at 06:26

Xonshiz

1,307
2
20
48

votes

8 answers

How to find tag with particular text with Beautiful Soup?

How to find text I am looking for in the following HTML (line breaks marked with \n)? ... \n "Some text:"\n
\n some value\n \n "Fixed text:"\n …

python html web-scraping beautifulsoup

asked Jan 25 '12 at 17:57

LA_

19,823
58
172
308

votes

4 answers

Scraping dynamic content using python-Scrapy

Disclaimer: I've seen numerous other similar posts on StackOverflow and tried to do it the same way but was they don't seem to work on this website. I'm using Python-Scrapy for getting data from koovs.com. However, I'm not able to get the product…

python web-scraping scrapy

asked May 20 '15 at 09:27

Pravesh Jain

4,128
6
28
47

votes

4 answers

How to scroll down with Phantomjs to load dynamic content

I am trying to scrape links from a page that generates content dynamically as the user scroll down to the bottom (infinite scrolling). I have tried doing different things with Phantomjs but not able to gather links beyond first page. Let say the…

javascript dom web-scraping screen-scraping phantomjs

asked May 15 '13 at 09:36

Puneet Saini

votes

8 answers

Page content is loaded with JavaScript and Jsoup doesn't see it

One block on the page is filled with content by JavaScript and after loading page with Jsoup there is none of that inforamtion. Is there a way to get also JavaScript generated content when parsing page with Jsoup? Can't paste page code here, since…

java html web-scraping jsoup

asked Sep 20 '11 at 17:01

Eugene

4,352
8
55
79

Prev 1 2 3

…

99 100 Next

Questions tagged [web-scraping]

A note on spelling

Further Reading: