Questions tagged [web-scraping]

Web scraping is the process of extracting specific information from websites that do not readily provide an API or other methods of automated data retrieval. Questions about "How To Get Started With Scraping" (e.g. with Excel VBA) should be *thoroughly researched* as numerous functional code samples are available. Web scraping methods include 3rd-party applications, development of custom software, or even manual data collection in a standardized way.

Web scraping (also known as web harvesting, web mining or web data extraction) is the act of using programming to extract information from the web.

Web scraping works by requesting HTML pages from a website and extracting specific data by taking advantage of patterns in the HTML markup, or by embedding a fully-fledged web browser. More advanced systems of web scraping, namely with regards to magnitude, scheduling, and automation, are often referred to as spiders, or web crawlers.

Potential uses include:

  • Retrieving product or stock prices comparison for comparison,

  • Contact scraping and collecting email addresses,

  • Site mashup or building an alternative front-end for an existing site,

  • Collection of real-estate pricing or auto sales statistics,

  • Website change detection

  • Building archives of dead pages

The practice of Web scraping has drawn a lot of controversies because the terms of use or copyrights for some websites and electronic publications do not allow certain kinds of data mining. While web scraping is not illegal on its own, legal issues can arise if being done with malicious or plagiaristic intentions, to circumvent a site's purchasing system or subscription fees, or other fraudulent or maligned purposes.

There have been numerous cases of lawsuits and other legal actions against companies and individuals. Before attempting to extract any information from a website in a method that is potentially contrary to the sites' indented usage, it is important to exercise due diligence in educating yourself in applicable local and international laws as well as the site's terms of service, copyrights, and trademarks. Further discussion on legal implications is can be found online including Wikipedia, Hacker News and Laws.com.

Web crawling is a component of web scraping across multiple sites, indexing information on the web using a bot or "spider" and is a universal technique adopted by most search engines while honoring exclusion requests such as those published a robots.txt file place on the site.

In contrast, web scraping focuses more on the transformation of unstructured data on the web, typically from HTML into a structured form that can be more easily stored, manipulated, and analyzed using tools such as a database or spreadsheet.

Screen scraping has a similar purpose but involves the programmatic collection of visual data from a source (as opposed to parsing data as in Web scraping) and originally involved reading the terminals memory or video data by connecting the terminals to another computer's input port.

is most often tagged along with:

   ➡       ( including , and )
   ➡         ( including and )
   ➡              ( including )
   ➡
   ➡          ( including )
   ➡          ( including )
   ➡
   ➡          (including )


A note on spelling

The verb is spelled to scrape, or as the present participle scraping, and is not to be confused with to scrap or scrapping, which is to discard something you no longer want or need, or to not continue with a plan.


Further Reading:

49536 questions
73
votes
6 answers

What should I use to open a url instead of urlopen in urllib3

I wanted to write a piece of code like the following: from bs4 import BeautifulSoup import urllib2 url = 'http://www.thefamouspeople.com/singers.php' html = urllib2.urlopen(url) soup = BeautifulSoup(html) But I found that I have to install urllib3…
niloofar
  • 2,244
  • 5
  • 23
  • 44
71
votes
5 answers

Python - make a POST request using Python 3 urllib

I am trying to make a POST request to the following page: http://search.cpsa.ca/PhysicianSearch In order to simulate clicking the 'Search' button without filling out any of the form, which adds data to the page. I got the POST header information by…
Daniel Paczuski Bak
  • 3,720
  • 8
  • 32
  • 78
69
votes
4 answers

Click a Button in Scrapy

I'm using Scrapy to crawl a webpage. Some of the information I need only pops up when you click on a certain button (of course also appears in the HTML code after clicking). I found out that Scrapy can handle forms (like logins) as shown here. But…
naeg
  • 3,944
  • 3
  • 24
  • 29
69
votes
9 answers

How to print an exception in Python 3?

Right now, I catch the exception in the except Exception: clause, and do print(exception). The result provides no information since it always prints . I knew this used to work in python 2, but how do I do it in python3?
Haonan Chen
  • 890
  • 1
  • 6
  • 11
67
votes
5 answers

Get meta tag content property with BeautifulSoup and Python

I am trying to use python and beautiful soup to extract the content part of the tags below: I'm…
the_t_test_1
  • 1,193
  • 1
  • 12
  • 28
66
votes
6 answers

How to manage a 'pool' of PhantomJS instances

I'm planning a webservice for my own use internally that takes one argument, a URL, and returns html representing the resolved DOM from that URL. By resolved I mean that the webservice will firstly get the page at that URL, then use PhantomJS to…
Trindaz
  • 17,029
  • 21
  • 82
  • 111
66
votes
5 answers

Scrape An Entire Website

I'm looking for recommendations for a program to scrape and download an entire corporate website. The site is powered by a CMS that has stopped working and getting it fixed is expensive and we are able to redevelop the website. So I would like to…
Dale Fraser
  • 4,623
  • 7
  • 39
  • 76
66
votes
8 answers

Scrape web pages in real time with Node.js

What's a good was to scrape website content using Node.js. I'd like to build something very, very fast that can execute searches in the style of kayak.com, where one query is dispatched to several different sites, the results scraped, and returned…
Avishai
  • 4,512
  • 4
  • 41
  • 67
66
votes
4 answers

Using BeautifulSoup to extract text without tags

My webpage looks like this:

YOB: 1987
RACE: WHITE
GENDER: FEMALE
HEIGHT: 5'05''

myloginid
  • 1,463
  • 2
  • 22
  • 37
64
votes
10 answers

Web scraping - how to identify main content on a webpage

Given a news article webpage (from any major news source such as times or bloomberg), I want to identify the main article content on that page and throw out the other misc elements such as ads, menus, sidebars, user comments. What's a generic way…
kefeizhou
  • 6,234
  • 10
  • 42
  • 55
64
votes
6 answers

Change IP address dynamically?

Consider the case, I want to crawl websites frequently, but my IP address got blocked after some day/limit. So, how can change my IP address dynamically or any other ideas?
Magendran V
  • 1,411
  • 3
  • 19
  • 33
64
votes
4 answers

csv.writer writing each character of word in separate column/cell

Objective: To extract the text from the anchor tag inside all lines in models and put it in a csv. I'm trying this code: with open('Sprint_data.csv', 'ab') as csvfile: spamwriter = csv.writer(csvfile) models = soup.find_all('li' , {"class" :…
vivekanon
  • 1,813
  • 3
  • 22
  • 44
62
votes
6 answers

Save and render a webpage with PhantomJS and node.js

I'm looking for an example of requesting a webpage, waiting for the JavaScript to render (JavaScript modifies the DOM), and then grabbing the HTML of the page. This should be a simple example with an obvious use-case for PhantomJS. I can't find a…
Harry
  • 52,711
  • 71
  • 177
  • 261
62
votes
4 answers

Python: Disable images in Selenium Google ChromeDriver

I spend a lot of time searching about this. At the end of the day I combined a number of answers and it works. I share my answer and I'll appreciate it if anyone edits it or provides us with an easier way to do this. 1- The answer in Disable images…
1man
  • 5,216
  • 7
  • 42
  • 56
61
votes
5 answers

How can I download a file on a click event using selenium?

I am working on python and selenium. I want to download file from clicking event using selenium. I wrote following code. from selenium import webdriver from selenium.common.exceptions import NoSuchElementException from…
sam
  • 18,509
  • 24
  • 83
  • 116