Questions tagged [web-scraping]

Web scraping is the process of extracting specific information from websites that do not readily provide an API or other methods of automated data retrieval. Questions about "How To Get Started With Scraping" (e.g. with Excel VBA) should be *thoroughly researched* as numerous functional code samples are available. Web scraping methods include 3rd-party applications, development of custom software, or even manual data collection in a standardized way.

Web scraping (also known as web harvesting, web mining or web data extraction) is the act of using programming to extract information from the web.

Web scraping works by requesting HTML pages from a website and extracting specific data by taking advantage of patterns in the HTML markup, or by embedding a fully-fledged web browser. More advanced systems of web scraping, namely with regards to magnitude, scheduling, and automation, are often referred to as spiders, or web crawlers.

Potential uses include:

  • Retrieving product or stock prices comparison for comparison,

  • Contact scraping and collecting email addresses,

  • Site mashup or building an alternative front-end for an existing site,

  • Collection of real-estate pricing or auto sales statistics,

  • Website change detection

  • Building archives of dead pages

The practice of Web scraping has drawn a lot of controversies because the terms of use or copyrights for some websites and electronic publications do not allow certain kinds of data mining. While web scraping is not illegal on its own, legal issues can arise if being done with malicious or plagiaristic intentions, to circumvent a site's purchasing system or subscription fees, or other fraudulent or maligned purposes.

There have been numerous cases of lawsuits and other legal actions against companies and individuals. Before attempting to extract any information from a website in a method that is potentially contrary to the sites' indented usage, it is important to exercise due diligence in educating yourself in applicable local and international laws as well as the site's terms of service, copyrights, and trademarks. Further discussion on legal implications is can be found online including Wikipedia, Hacker News and Laws.com.

Web crawling is a component of web scraping across multiple sites, indexing information on the web using a bot or "spider" and is a universal technique adopted by most search engines while honoring exclusion requests such as those published a robots.txt file place on the site.

In contrast, web scraping focuses more on the transformation of unstructured data on the web, typically from HTML into a structured form that can be more easily stored, manipulated, and analyzed using tools such as a database or spreadsheet.

Screen scraping has a similar purpose but involves the programmatic collection of visual data from a source (as opposed to parsing data as in Web scraping) and originally involved reading the terminals memory or video data by connecting the terminals to another computer's input port.

is most often tagged along with:

   ➡       ( including , and )
   ➡         ( including and )
   ➡              ( including )
   ➡
   ➡          ( including )
   ➡          ( including )
   ➡
   ➡          (including )


A note on spelling

The verb is spelled to scrape, or as the present participle scraping, and is not to be confused with to scrap or scrapping, which is to discard something you no longer want or need, or to not continue with a plan.


Further Reading:

49536 questions
60
votes
8 answers

Puppeteer - Protocol error (Page.navigate): Target closed

As you can see with the sample code below, I'm using Puppeteer with a cluster of workers in Node to run multiple requests of websites screenshots by a given URL: const cluster = require('cluster'); const express = require('express'); const…
57
votes
10 answers

How to "scan" a website (or page) for info, and bring it into my program?

Well, I'm pretty much trying to figure out how to pull information from a webpage, and bring it into my program (in Java). For example, if I know the exact page I want info from, for the sake of simplicity a Best Buy item page, how would I get the…
James
  • 5,622
  • 9
  • 34
  • 42
57
votes
10 answers

How do you scrape AJAX pages?

Please advise how to scrape AJAX pages.
xxxxxxx
  • 5,037
  • 6
  • 28
  • 26
55
votes
3 answers

Scraping a JSON response with Scrapy

How do you use Scrapy to scrape web requests that return JSON? For example, the JSON would look like this: { "firstName": "John", "lastName": "Smith", "age": 25, "address": { "streetAddress": "21 2nd Street", "city":…
Thomas Kingaroy
  • 575
  • 1
  • 5
  • 7
54
votes
6 answers

How to give delay between each requests in scrapy?

I don't want to crawl simultaneously and get blocked. I would like to send one request per second.
nizam.sp
  • 4,002
  • 5
  • 39
  • 63
54
votes
5 answers

How can I get the CSS Selector in Chrome?

I want to be able to select/highlight an element on the page and find its selector like this: div.firstRow div.priceAvail>div>div.PriceCompare>div.BodyS I know you can see the selection on the bottom after doing an inspect element, but how can I…
kale
  • 1,161
  • 1
  • 9
  • 16
53
votes
4 answers

Web Scraping With Haskell

What is the current state of libraries for scraping websites with Haskell? I'm trying to make myself do more of my quick oneoff tasks in Haskell, in order to help increase my comfort level with the language. In Python, I tend to use the excellent…
ricree
  • 35,626
  • 13
  • 36
  • 27
52
votes
11 answers

Java HTML Parsing

I'm working on an app which scrapes data from a website and I was wondering how I should go about getting the data. Specifically I need data contained in a number of div tags which use a specific CSS class - Currently (for testing purposes) I'm…
Richard Walton
  • 4,789
  • 3
  • 38
  • 49
52
votes
9 answers

How to get the scrapy failure URLs?

I'm a newbie of scrapy and it's amazing crawler framework i have known! In my project, I sent more than 90, 000 requests, but there are some of them failed. I set the log level to be INFO, and i just can see some statistics but no details.…
Joe Wu
  • 727
  • 1
  • 8
  • 14
51
votes
2 answers

Puppeteer Execution context was destroyed, most likely because of a navigation

I am facing this problem in puppeteer in a for loop when i go on another page to get data, then when i go back it comes me this error line: Error "We have an error Error: the execution context was destroyed, probably because of a navigation." It's…
49
votes
11 answers

Fetch all href link using selenium in python

I am practicing Selenium in Python and I wanted to fetch all the links on a web page using Selenium. For example, I want all the links in the href= property of all the tags on http://psychoticelites.com/ I've written a script and it is working.…
Xonshiz
  • 1,307
  • 2
  • 20
  • 48
48
votes
8 answers

How to find tag with particular text with Beautiful Soup?

How to find text I am looking for in the following HTML (line breaks marked with \n)? ... \n "Some text:"\n
\n some value\n \n "Fixed text:"\n …
LA_
  • 19,823
  • 58
  • 172
  • 308
47
votes
4 answers

Scraping dynamic content using python-Scrapy

Disclaimer: I've seen numerous other similar posts on StackOverflow and tried to do it the same way but was they don't seem to work on this website. I'm using Python-Scrapy for getting data from koovs.com. However, I'm not able to get the product…
Pravesh Jain
  • 4,128
  • 6
  • 28
  • 47
47
votes
4 answers

How to scroll down with Phantomjs to load dynamic content

I am trying to scrape links from a page that generates content dynamically as the user scroll down to the bottom (infinite scrolling). I have tried doing different things with Phantomjs but not able to gather links beyond first page. Let say the…
Puneet Saini
  • 597
  • 1
  • 7
  • 12
46
votes
8 answers

Page content is loaded with JavaScript and Jsoup doesn't see it

One block on the page is filled with content by JavaScript and after loading page with Jsoup there is none of that inforamtion. Is there a way to get also JavaScript generated content when parsing page with Jsoup? Can't paste page code here, since…
Eugene
  • 4,352
  • 8
  • 55
  • 79