Questions tagged [web-scraping]

Web scraping is the process of extracting specific information from websites that do not readily provide an API or other methods of automated data retrieval. Questions about "How To Get Started With Scraping" (e.g. with Excel VBA) should be *thoroughly researched* as numerous functional code samples are available. Web scraping methods include 3rd-party applications, development of custom software, or even manual data collection in a standardized way.

Web scraping (also known as web harvesting, web mining or web data extraction) is the act of using programming to extract information from the web.

Web scraping works by requesting HTML pages from a website and extracting specific data by taking advantage of patterns in the HTML markup, or by embedding a fully-fledged web browser. More advanced systems of web scraping, namely with regards to magnitude, scheduling, and automation, are often referred to as spiders, or web crawlers.

Potential uses include:

Retrieving product or stock prices comparison for comparison,
Contact scraping and collecting email addresses,
Site mashup or building an alternative front-end for an existing site,
Collection of real-estate pricing or auto sales statistics,
Website change detection
Building archives of dead pages

The practice of Web scraping has drawn a lot of controversies because the terms of use or copyrights for some websites and electronic publications do not allow certain kinds of data mining. While web scraping is not illegal on its own, legal issues can arise if being done with malicious or plagiaristic intentions, to circumvent a site's purchasing system or subscription fees, or other fraudulent or maligned purposes.

There have been numerous cases of lawsuits and other legal actions against companies and individuals. Before attempting to extract any information from a website in a method that is potentially contrary to the sites' indented usage, it is important to exercise due diligence in educating yourself in applicable local and international laws as well as the site's terms of service, copyrights, and trademarks. Further discussion on legal implications is can be found online including Wikipedia, Hacker News and Laws.com.

Web crawling is a component of web scraping across multiple sites, indexing information on the web using a bot or "spider" and is a universal technique adopted by most search engines while honoring exclusion requests such as those published a robots.txt file place on the site.

In contrast, web scraping focuses more on the transformation of unstructured data on the web, typically from HTML into a structured form that can be more easily stored, manipulated, and analyzed using tools such as a database or spreadsheet.

Screen scraping has a similar purpose but involves the programmatic collection of visual data from a source (as opposed to parsing data as in Web scraping) and originally involved reading the terminals memory or video data by connecting the terminals to another computer's input port.

web-scraping is most often tagged along with:

➡ ^{python ( including beautifulsoup, scrapy and selenium )}
➡ ^{javascript ( including node.js and phantomjs )}
➡ ^{r ( including rvest )}
➡ ^selenium
➡ ^{xml ( including xpath )}
➡ ^{java ( including jsoup )}
➡ ^php
➡ ^{vba (including vba-excel)}

A note on spelling

The verb is spelled to scrape, or as the present participle scraping, and is not to be confused with to scrap or scrapping, which is to discard something you no longer want or need, or to not continue with a plan.

Wikipedia: Web scraping
^{Overview of the types of web scraping, as well as techniques, software, legalities and prevention.}
GitHub: Guide to Preventing Web Scraping
^{Detailed advice on prevention of web scraping. (Original article on Stack Overflow)}
HartleyBrody: I Don't Need No Stinking API
^{A commercial blogger's views, advice and tips for scraping}
Stack Overflow: Scraping data from website using VBA
^{Discussion and examples of getting started scraping with VBA}
SitePoint: Web Scraping for Beginners
^{Theory and examples for beginner to web scraping}

49536 questions

151

votes

11 answers

How to scrape only visible webpage text with BeautifulSoup?

Basically, I want to use BeautifulSoup to grab strictly the visible text on a webpage. For instance, this webpage is my test case. And I mainly want to just get the body text (article) and maybe even a few tab names here and there. I have tried the…

python web-scraping text beautifulsoup html-content-extraction

asked Dec 20 '09 at 17:55

user233864

1,727
2
13
12

114

votes

2 answers

What's the best way of scraping data from a website?

I need to extract contents from a website, but the application doesn’t provide any application programming interface or another mechanism to access that data programmatically. I found a useful third-party tool called Import.io that provides click…

api web-scraping screen-scraping

asked Mar 04 '14 at 10:11

0x1ad2

8,014
9
35
48

102

votes

6 answers

What is the difference between web-crawling and web-scraping?

Is there a difference between Crawling and Web-scraping? If there's a difference, what's the best method to use in order to collect some web data to supply a database for later use in a customised search engine?

search-engine web-scraping web-crawler

asked Dec 01 '10 at 17:54

wassimans

8,382
10
47
58

101

votes

2 answers

selenium with scrapy for dynamic page

I'm trying to scrape product information from a webpage, using scrapy. My to-be-scraped webpage looks like this: starts with a product_list page with 10 products a click on "next" button loads the next 10 products (url doesn't change between the…

python selenium selenium-webdriver web-scraping scrapy

asked Jul 31 '13 at 16:08

Z. Lin

1,422
3
12
16

votes

5 answers

How to scrape a website which requires login using python and beautifulsoup?

If I want to scrape a website that requires login with password first, how can I start scraping it with python using beautifulsoup4 library? Below is what I do for websites that do not require login. from bs4 import BeautifulSoup import urllib2…

python web-scraping beautifulsoup

asked Apr 16 '14 at 07:33

guagay_wk

26,337
54
186
295

votes

7 answers

Using python Requests with javascript pages

I am trying to use the Requests framework with python (http://docs.python-requests.org/en/latest/) but the page I am trying to get to uses javascript to fetch the info that I want. I have tried to search on the web for a solution but the fact that…

python web-scraping python-requests

asked Oct 15 '14 at 22:31

biw

3,000
4
23
40

votes

8 answers

How to run Scrapy from within a Python script

I'm new to Scrapy and I'm looking for a way to run it from a Python script. I found 2 sources that explain this: http://tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script/ http://snipplr.com/view/67006/using-scrapy-from-a-script/ I can't…

python web-scraping web-crawler scrapy

asked Nov 18 '12 at 04:09

user47954

votes

8 answers

Extracting an information from web page by machine learning

I would like to extract a specific type of information from web pages in Python. Let's say postal address. It has thousands of forms, but still, it is somehow recognizable. As there is a large number of forms, it would be probably very difficult to…

python machine-learning html-parsing web-scraping extract

asked Nov 11 '12 at 23:27

Honza Javorek

8,566
8
47
66

votes

4 answers

How to manage log in session through headless chrome?

I want to create a scraper that: opens a headless browser, goes to a url, logs in (there is steam oauth), fills some inputs, and clicks 2 buttons. My problem is that every new instance of headless browser clears my login session, and then I need…

javascript cookies web-scraping headless puppeteer

asked Feb 04 '18 at 14:15

Anton Kurtin

votes

3 answers

Is it ok to scrape data from Google results?

I'd like to fetch results from Google using curl to detect potential duplicate content. Is there a high risk of being banned by Google?

web-scraping

asked Mar 26 '14 at 10:07

ML_

1,000
1
7
8

votes

7 answers

Selenium-Debugging: Element is not clickable at point (X,Y)

I try to scrape this site by Selenium. I want to click in "Next Page" buttom, for this I do: driver.find_element_by_class_name('pagination-r').click() it works for many pages but not for all, I got this error WebDriverException: Message: Element is…

python selenium-webdriver web-scraping selenium-firefoxdriver

asked Jun 17 '16 at 10:18

parik

2,313
12
39
67

votes

18 answers

Converting html to text with Python

I am trying to convert an html block to text using Python. Input:

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean…

python html web-scraping text beautifulsoup

asked Feb 04 '13 at 19:55

Aaron Bandelli

1,238
2
14
16

votes

7 answers

Web Scraping in a Google Chrome Extension (JavaScript + Chrome APIs)

What are the best options for performing Web Scraping of a not currently open tab from within a Google Chrome Extension with JavaScript and whatever more technologies are available. Other JavaScript-libraries are also accepted. The important thing…

javascript google-chrome google-chrome-extension xmlhttprequest web-scraping

asked Jun 28 '11 at 14:48

Seb Nilsson

26,200
30
103
130

votes

10 answers

Web scraping with Java

I'm not able to find any good web scraping Java based API. The site which I need to scrape does not provide any API as well; I want to iterate over all web pages using some pageID and extract the HTML titles / other stuff in their DOM trees. Are…

java web-scraping frameworks

asked Jul 08 '10 at 09:38

NoneType

votes

4 answers

Simple jQuery selector only selects first element in Chrome..?

I'm a bit new to jQuery so forgive me for being dense. I want to select all elements on a particular page via Chrome's JS console: $('td') Yet when I do this, I get the following output: Apples Isn't jQuery supposed to return an…

jquery google-chrome web-scraping

asked Jan 13 '13 at 21:49

fbonetti

6,652
3
34
32

Prev 1

…

99 100 Next

Questions tagged [web-scraping]

A note on spelling

Further Reading: