Questions tagged [web-scraping]

Web scraping is the process of extracting specific information from websites that do not readily provide an API or other methods of automated data retrieval. Questions about "How To Get Started With Scraping" (e.g. with Excel VBA) should be *thoroughly researched* as numerous functional code samples are available. Web scraping methods include 3rd-party applications, development of custom software, or even manual data collection in a standardized way.

Web scraping (also known as web harvesting, web mining or web data extraction) is the act of using programming to extract information from the web.

Web scraping works by requesting HTML pages from a website and extracting specific data by taking advantage of patterns in the HTML markup, or by embedding a fully-fledged web browser. More advanced systems of web scraping, namely with regards to magnitude, scheduling, and automation, are often referred to as spiders, or web crawlers.

Potential uses include:

Retrieving product or stock prices comparison for comparison,
Contact scraping and collecting email addresses,
Site mashup or building an alternative front-end for an existing site,
Collection of real-estate pricing or auto sales statistics,
Website change detection
Building archives of dead pages

The practice of Web scraping has drawn a lot of controversies because the terms of use or copyrights for some websites and electronic publications do not allow certain kinds of data mining. While web scraping is not illegal on its own, legal issues can arise if being done with malicious or plagiaristic intentions, to circumvent a site's purchasing system or subscription fees, or other fraudulent or maligned purposes.

There have been numerous cases of lawsuits and other legal actions against companies and individuals. Before attempting to extract any information from a website in a method that is potentially contrary to the sites' indented usage, it is important to exercise due diligence in educating yourself in applicable local and international laws as well as the site's terms of service, copyrights, and trademarks. Further discussion on legal implications is can be found online including Wikipedia, Hacker News and Laws.com.

Web crawling is a component of web scraping across multiple sites, indexing information on the web using a bot or "spider" and is a universal technique adopted by most search engines while honoring exclusion requests such as those published a robots.txt file place on the site.

In contrast, web scraping focuses more on the transformation of unstructured data on the web, typically from HTML into a structured form that can be more easily stored, manipulated, and analyzed using tools such as a database or spreadsheet.

Screen scraping has a similar purpose but involves the programmatic collection of visual data from a source (as opposed to parsing data as in Web scraping) and originally involved reading the terminals memory or video data by connecting the terminals to another computer's input port.

web-scraping is most often tagged along with:

➡ ^{python ( including beautifulsoup, scrapy and selenium )}
➡ ^{javascript ( including node.js and phantomjs )}
➡ ^{r ( including rvest )}
➡ ^selenium
➡ ^{xml ( including xpath )}
➡ ^{java ( including jsoup )}
➡ ^php
➡ ^{vba (including vba-excel)}

A note on spelling

The verb is spelled to scrape, or as the present participle scraping, and is not to be confused with to scrap or scrapping, which is to discard something you no longer want or need, or to not continue with a plan.

Wikipedia: Web scraping
^{Overview of the types of web scraping, as well as techniques, software, legalities and prevention.}
GitHub: Guide to Preventing Web Scraping
^{Detailed advice on prevention of web scraping. (Original article on Stack Overflow)}
HartleyBrody: I Don't Need No Stinking API
^{A commercial blogger's views, advice and tips for scraping}
Stack Overflow: Scraping data from website using VBA
^{Discussion and examples of getting started scraping with VBA}
SitePoint: Web Scraping for Beginners
^{Theory and examples for beginner to web scraping}

49536 questions

votes

1 answer

How to scrape and extract all the subcategories names from all its associated pages for a wikipedia category using python 3.6?

I want to scrape all the subcategories and pages under the category header of the Category page: "Category:Computer science". The link for the same is as follows: http://en.wikipedia.org/wiki/Category:Computer_science. I have got an idea regarding…

python python-3.x web-scraping beautifulsoup wikipedia

asked Oct 06 '18 at 17:47

M S

votes

2 answers

FileNotFoundError in 'wb' file mode in Python?

I am trying to write a program that downloads all the xkcd comics images and save them in a directory, with all the images name as title.png, title being the title of the comic. Here's the code for it: #Downloads all the xkcd comics import…

python web-scraping beautifulsoup python-requests

asked Oct 05 '18 at 02:34

Udasi Tharani

votes

1 answer

How to extract the file modification time of a scraped image?

I'm trying to scrape part of a part-website that contain images of the parts, to collect some statistics. However, there is no url or image upload or creation date, so I have to use the approximate image file modification-date to get this info.…

python web-scraping scrapy

asked Sep 17 '18 at 10:18

not2qubit

14,531
8
95
135

votes

2 answers

Scrapy: scraping data from Pagination

so far I have scraped data from one page. I want to continue until the end of the pagination. Click Here to view the page There seems to be a problem because the href contains a javascript element.

python xpath web-scraping scrapy

asked Sep 09 '18 at 03:34

Riwaj Chalise

votes

1 answer

Web Scraping a tableauViz into an R dataframe

I have spent a lot of time searching for an answer to this, but have not found anything yet. What I am trying to accomplish is to scrape Tableau table information that is contained in a tableauViz element and propagate it into an R dataframe. In my…

r xml web-scraping tableau-api

asked Sep 07 '18 at 16:11

UTexas80

votes

1 answer

Web scraping using selenium and bs4

I'm trying to build a dataframe based on web scraping of that page https://www.schoolholidayseurope.eu/choose-a-country html firstable i said to selenium to click on page of my choice then i put xpath and tags elements for build header and body but…

python html web-scraping beautifulsoup selenium-chromedriver

asked Sep 07 '18 at 07:52

ALEXANDRE W.

votes

5 answers

Python 3: How to web scrape text from div that contains multiple class values

I'm trying to web scrape a website (Here is the link to website), but the div in the page seems to have multiple class attributes which is making me hard to scrape the data. I tried to look for historical questions posted on Stackoverflow, but could…

html python-3.x selenium web-scraping beautifulsoup

asked Sep 06 '18 at 01:14

DanLee

votes

2 answers

How to download file from a page using python

I am having troubles downloading txt file from this page: https://www.ceps.cz/en/all-data#RegulationEnergy (when you scroll down and see Download: txt, xls and xml). My goal is to create scraper that will go to the linked page, clicks on the txt…

python selenium web-scraping python-requests

asked Sep 04 '18 at 16:36

Loko

votes

0 answers

multiprocessing pool with a dictionary as one of the arguments?

Is it possible to use Pool.map() on a function that contains an empty dictionary as one of its arguments? I am new to multiprocessing and want to parallise a web-scraping function. I tried following the example from this site however it doesn't…

python-3.x web-scraping multiprocessing

asked Aug 31 '18 at 04:49

Spencer Trinh

votes

2 answers

Google scraping using python - requests: How to avoid being blocked due to many requests?

For a school project I need get the web addresses of 200 companies (based on a list). My script is working fine, but when I'm around the company 80, I get blocked by google. This is the message that I'm getting. > Our systems have detected unusual…

python python-2.7 web-scraping python-requests

asked Aug 21 '18 at 17:25

PAstudilloE

votes

1 answer

Scraping pagination with Python

I`m trying to scrape some data for airlines from the following website: http://www.airlinequality.com/airline-reviews/airasia-x[1]. I managed to get the data I need, but I am struggling with pagination on the web page. I`m trying to get all the…

python web-scraping pagination scrapy

asked Aug 20 '18 at 00:21

onr

votes

1 answer

Table element not showing in BeautifulSoup

I am trying to extract table data from this web site Following is the code-- import requests from bs4 import BeautifulSoup as bs page = requests.get('https://www.vitalityservicing.com/serviceapi/Monitoring/QueueDepth?tenantId=1') soup =…

python html web-scraping beautifulsoup python-requests

asked Aug 17 '18 at 14:39

spark

1,271
1
12
18

votes

2 answers

Selenium, Presence of one of many elements located?

Building off of the answer to How to wait until the page is loaded with Selenium for Python? I am attempting to create a method that allows multiple elements to be polled for presence using Expected Conditions. I receive an error 'bool' object is…

python selenium web-scraping expected-condition

asked Aug 01 '18 at 20:09

Liquidgenius

votes

3 answers

How do I check if a URL has a link on botw.org or not?

I am developing an application in which I have to check whether a link exists on botw.org for a given URL. Is there any free API available to check botw.org, or any other source to check this? thanks!

java api hyperlink web-scraping web-crawler

asked Mar 02 '11 at 05:10

Tokendra Kumar Sahu

3,524
11
28
29

votes

1 answer

Manually change response URL during Puppeteer request interception

I'm having a hard time navigating relative urls with puppeteer for a specific use case. Below you can see the basic setup and an pseudo example describing the problem. Essentially I want to change the current url the browser thinks he is at. What I…

javascript web-scraping puppeteer

asked Jul 30 '18 at 14:58

joe.hart

Prev 1 2 3

…

100

Questions tagged [web-scraping]

A note on spelling

Further Reading: