Questions tagged [web-scraping]

Web scraping is the process of extracting specific information from websites that do not readily provide an API or other methods of automated data retrieval. Questions about "How To Get Started With Scraping" (e.g. with Excel VBA) should be *thoroughly researched* as numerous functional code samples are available. Web scraping methods include 3rd-party applications, development of custom software, or even manual data collection in a standardized way.

Web scraping (also known as web harvesting, web mining or web data extraction) is the act of using programming to extract information from the web.

Web scraping works by requesting HTML pages from a website and extracting specific data by taking advantage of patterns in the HTML markup, or by embedding a fully-fledged web browser. More advanced systems of web scraping, namely with regards to magnitude, scheduling, and automation, are often referred to as spiders, or web crawlers.

Potential uses include:

Retrieving product or stock prices comparison for comparison,
Contact scraping and collecting email addresses,
Site mashup or building an alternative front-end for an existing site,
Collection of real-estate pricing or auto sales statistics,
Website change detection
Building archives of dead pages

The practice of Web scraping has drawn a lot of controversies because the terms of use or copyrights for some websites and electronic publications do not allow certain kinds of data mining. While web scraping is not illegal on its own, legal issues can arise if being done with malicious or plagiaristic intentions, to circumvent a site's purchasing system or subscription fees, or other fraudulent or maligned purposes.

There have been numerous cases of lawsuits and other legal actions against companies and individuals. Before attempting to extract any information from a website in a method that is potentially contrary to the sites' indented usage, it is important to exercise due diligence in educating yourself in applicable local and international laws as well as the site's terms of service, copyrights, and trademarks. Further discussion on legal implications is can be found online including Wikipedia, Hacker News and Laws.com.

Web crawling is a component of web scraping across multiple sites, indexing information on the web using a bot or "spider" and is a universal technique adopted by most search engines while honoring exclusion requests such as those published a robots.txt file place on the site.

In contrast, web scraping focuses more on the transformation of unstructured data on the web, typically from HTML into a structured form that can be more easily stored, manipulated, and analyzed using tools such as a database or spreadsheet.

Screen scraping has a similar purpose but involves the programmatic collection of visual data from a source (as opposed to parsing data as in Web scraping) and originally involved reading the terminals memory or video data by connecting the terminals to another computer's input port.

web-scraping is most often tagged along with:

➡ ^{python ( including beautifulsoup, scrapy and selenium )}
➡ ^{javascript ( including node.js and phantomjs )}
➡ ^{r ( including rvest )}
➡ ^selenium
➡ ^{xml ( including xpath )}
➡ ^{java ( including jsoup )}
➡ ^php
➡ ^{vba (including vba-excel)}

A note on spelling

The verb is spelled to scrape, or as the present participle scraping, and is not to be confused with to scrap or scrapping, which is to discard something you no longer want or need, or to not continue with a plan.

Wikipedia: Web scraping
^{Overview of the types of web scraping, as well as techniques, software, legalities and prevention.}
GitHub: Guide to Preventing Web Scraping
^{Detailed advice on prevention of web scraping. (Original article on Stack Overflow)}
HartleyBrody: I Don't Need No Stinking API
^{A commercial blogger's views, advice and tips for scraping}
Stack Overflow: Scraping data from website using VBA
^{Discussion and examples of getting started scraping with VBA}
SitePoint: Web Scraping for Beginners
^{Theory and examples for beginner to web scraping}

49536 questions

votes

6 answers

What should I use to open a url instead of urlopen in urllib3

I wanted to write a piece of code like the following: from bs4 import BeautifulSoup import urllib2 url = 'http://www.thefamouspeople.com/singers.php' html = urllib2.urlopen(url) soup = BeautifulSoup(html) But I found that I have to install urllib3…

python web-scraping beautifulsoup urllib3

asked Apr 09 '16 at 11:33

niloofar

2,244
5
23
44

votes

5 answers

Python - make a POST request using Python 3 urllib

I am trying to make a POST request to the following page: http://search.cpsa.ca/PhysicianSearch In order to simulate clicking the 'Search' button without filling out any of the form, which adds data to the page. I got the POST header information by…

python http post web-scraping urllib

asked Apr 07 '16 at 18:17

Daniel Paczuski Bak

3,720
8
32
78

votes

4 answers

Click a Button in Scrapy

I'm using Scrapy to crawl a webpage. Some of the information I need only pops up when you click on a certain button (of course also appears in the HTML code after clicking). I found out that Scrapy can handle forms (like logins) as shown here. But…

python web-crawler web-scraping scrapy

asked Jul 13 '11 at 16:45

naeg

3,944
3
24
29

votes

9 answers

How to print an exception in Python 3?

Right now, I catch the exception in the except Exception: clause, and do print(exception). The result provides no information since it always prints . I knew this used to work in python 2, but how do I do it in python3?

python python-3.x exception web-scraping

asked Jan 11 '17 at 17:13

Haonan Chen

votes

5 answers

Get meta tag content property with BeautifulSoup and Python

I am trying to use python and beautiful soup to extract the content part of the tags below: I'm…

python html web-scraping beautifulsoup

asked Apr 21 '16 at 11:22

the_t_test_1

1,193
1
12
28

votes

6 answers

How to manage a 'pool' of PhantomJS instances

I'm planning a webservice for my own use internally that takes one argument, a URL, and returns html representing the resolved DOM from that URL. By resolved I mean that the webservice will firstly get the page at that URL, then use PhantomJS to…

node.js web-scraping phantomjs jsdom

asked Apr 01 '12 at 01:41

Trindaz

17,029
21
82
111

votes

5 answers

Scrape An Entire Website

I'm looking for recommendations for a program to scrape and download an entire corporate website. The site is powered by a CMS that has stopped working and getting it fixed is expensive and we are able to redevelop the website. So I would like to…

html web-scraping

asked Feb 13 '12 at 17:38

Dale Fraser

4,623
7
39
76

votes

8 answers

Scrape web pages in real time with Node.js

What's a good was to scrape website content using Node.js. I'd like to build something very, very fast that can execute searches in the style of kayak.com, where one query is dispatched to several different sites, the results scraped, and returned…

javascript jquery node.js screen-scraping web-scraping

asked Mar 06 '11 at 15:47

Avishai

4,512
4
41
67

votes

4 answers

Using BeautifulSoup to extract text without tags

My webpage looks like this:

YOB: 1987
RACE: WHITE
GENDER: FEMALE
HEIGHT: 5'05''
…

python web-scraping beautifulsoup

asked Apr 30 '14 at 05:15

myloginid

1,463
2
22
37

votes

10 answers

Web scraping - how to identify main content on a webpage

Given a news article webpage (from any major news source such as times or bloomberg), I want to identify the main article content on that page and throw out the other misc elements such as ads, menus, sidebars, user comments. What's a generic way…

python web-scraping html-parsing html

asked Jan 12 '11 at 17:46

kefeizhou

6,234
10
42
55

votes

6 answers

Change IP address dynamically?

Consider the case, I want to crawl websites frequently, but my IP address got blocked after some day/limit. So, how can change my IP address dynamically or any other ideas?

web-scraping ip web-crawler scrapy dynamic-ip

asked Mar 04 '15 at 10:27

Magendran V

1,411
3
19
33

votes

4 answers

csv.writer writing each character of word in separate column/cell

Objective: To extract the text from the anchor tag inside all lines in models and put it in a csv. I'm trying this code: with open('Sprint_data.csv', 'ab') as csvfile: spamwriter = csv.writer(csvfile) models = soup.find_all('li' , {"class" :…

python csv web-scraping

asked Feb 28 '13 at 07:08

vivekanon

1,813
3
22
44

votes

6 answers

Save and render a webpage with PhantomJS and node.js

I'm looking for an example of requesting a webpage, waiting for the JavaScript to render (JavaScript modifies the DOM), and then grabbing the HTML of the page. This should be a simple example with an obvious use-case for PhantomJS. I can't find a…

javascript html node.js web-scraping phantomjs

asked Apr 01 '12 at 18:01

Harry

52,711
71
177
261

votes

4 answers

Python: Disable images in Selenium Google ChromeDriver

I spend a lot of time searching about this. At the end of the day I combined a number of answers and it works. I share my answer and I'll appreciate it if anyone edits it or provides us with an easier way to do this. 1- The answer in Disable images…

python google-chrome selenium web-scraping web-crawler

asked Jan 21 '15 at 15:01

1man

5,216
7
42
56

votes

5 answers

How can I download a file on a click event using selenium?

I am working on python and selenium. I want to download file from clicking event using selenium. I wrote following code. from selenium import webdriver from selenium.common.exceptions import NoSuchElementException from…

python selenium selenium-webdriver web-scraping

asked Aug 26 '13 at 08:32

sam

18,509
24
83
116

Prev 1 2

…

99 100 Next

Questions tagged [web-scraping]

A note on spelling

Further Reading: