Questions tagged [web-scraping]

Web scraping is the process of extracting specific information from websites that do not readily provide an API or other methods of automated data retrieval. Questions about "How To Get Started With Scraping" (e.g. with Excel VBA) should be *thoroughly researched* as numerous functional code samples are available. Web scraping methods include 3rd-party applications, development of custom software, or even manual data collection in a standardized way.

Web scraping (also known as web harvesting, web mining or web data extraction) is the act of using programming to extract information from the web.

Web scraping works by requesting HTML pages from a website and extracting specific data by taking advantage of patterns in the HTML markup, or by embedding a fully-fledged web browser. More advanced systems of web scraping, namely with regards to magnitude, scheduling, and automation, are often referred to as spiders, or web crawlers.

Potential uses include:

Retrieving product or stock prices comparison for comparison,
Contact scraping and collecting email addresses,
Site mashup or building an alternative front-end for an existing site,
Collection of real-estate pricing or auto sales statistics,
Website change detection
Building archives of dead pages

The practice of Web scraping has drawn a lot of controversies because the terms of use or copyrights for some websites and electronic publications do not allow certain kinds of data mining. While web scraping is not illegal on its own, legal issues can arise if being done with malicious or plagiaristic intentions, to circumvent a site's purchasing system or subscription fees, or other fraudulent or maligned purposes.

There have been numerous cases of lawsuits and other legal actions against companies and individuals. Before attempting to extract any information from a website in a method that is potentially contrary to the sites' indented usage, it is important to exercise due diligence in educating yourself in applicable local and international laws as well as the site's terms of service, copyrights, and trademarks. Further discussion on legal implications is can be found online including Wikipedia, Hacker News and Laws.com.

Web crawling is a component of web scraping across multiple sites, indexing information on the web using a bot or "spider" and is a universal technique adopted by most search engines while honoring exclusion requests such as those published a robots.txt file place on the site.

In contrast, web scraping focuses more on the transformation of unstructured data on the web, typically from HTML into a structured form that can be more easily stored, manipulated, and analyzed using tools such as a database or spreadsheet.

Screen scraping has a similar purpose but involves the programmatic collection of visual data from a source (as opposed to parsing data as in Web scraping) and originally involved reading the terminals memory or video data by connecting the terminals to another computer's input port.

web-scraping is most often tagged along with:

➡ ^{python ( including beautifulsoup, scrapy and selenium )}
➡ ^{javascript ( including node.js and phantomjs )}
➡ ^{r ( including rvest )}
➡ ^selenium
➡ ^{xml ( including xpath )}
➡ ^{java ( including jsoup )}
➡ ^php
➡ ^{vba (including vba-excel)}

A note on spelling

The verb is spelled to scrape, or as the present participle scraping, and is not to be confused with to scrap or scrapping, which is to discard something you no longer want or need, or to not continue with a plan.

Wikipedia: Web scraping
^{Overview of the types of web scraping, as well as techniques, software, legalities and prevention.}
GitHub: Guide to Preventing Web Scraping
^{Detailed advice on prevention of web scraping. (Original article on Stack Overflow)}
HartleyBrody: I Don't Need No Stinking API
^{A commercial blogger's views, advice and tips for scraping}
Stack Overflow: Scraping data from website using VBA
^{Discussion and examples of getting started scraping with VBA}
SitePoint: Web Scraping for Beginners
^{Theory and examples for beginner to web scraping}

49536 questions

votes

2 answers

How to identify the classname or id in Python Scraping with beautifulsoup and selenium

I am building a scraper code and already have been able to read the table and the information that I want. The problem is with the next page link, I have tried using a class name and also and svg tag but the code breaks as the value of the class…

python selenium selenium-webdriver web-scraping beautifulsoup

asked Dec 05 '18 at 04:18

sputnikk1093

votes

1 answer

Python slowly scrapes websites

I've implemented news website scraper that scrapes by using Selenium web driver to access dynamic web pages and BeautifulSoup to retrieve the content. While parsing websites, I'm also writing scraped data to MongoDB storage and downloading pictures.…

python-3.x selenium parsing web-scraping beautifulsoup

asked Dec 03 '18 at 06:03

Irina Nazarchuk

votes

0 answers

Python html-requests render() doesn't render javascript elements

I am attempting to scrape a website which as well as requiring a login, the core data is rendered with javascript and XHR files. I am using the html-requests library, however the render() function appears to have no effect on the webpage. Here is my…

python python-3.x web-scraping python-requests python-requests-html

asked Nov 29 '18 at 05:36

S. Allen

votes

2 answers

WebDriverWait on finding element by CSS Selector

I want to retrieve the price of the flight of this webpage using Python 3: https://www.google.es/flights?lite=0#flt=/m/0h3tv./m/04jpl.2018-12-17;c:EUR;e:1;a:FR;sd:1;t:f;tt:o At first I got an error which after many hours I realized was due to the…

python css python-3.x selenium web-scraping

asked Nov 28 '18 at 20:06

David García Ballester

votes

1 answer

How to scrape a web site using Python, Requests and Xpath?

I try to scrape first name + last name of people on this web page (https://www.meleenumerique.com/scientist_comite) using the code below but it doesn't work. How can I determine what's wrong with it? This is the code I wrote from lxml import html …

python web-scraping python-requests lxml

asked Nov 24 '18 at 16:01

Nico2806

votes

1 answer

Scrapy crawl spider does not download files?

So I am made a crawl spider which crawls this website (https://minerals.usgs.gov/science/mineral-deposit-database/#products, follows every link on that web page, from which it scrapes the title and it is suppesed to download the file as well.…

python-3.x web-scraping scrapy

asked Nov 19 '18 at 17:50

GKV

votes

2 answers

Unable to get text from parent and child nodes/tags with Scrapy

before this is marked as duplicate, I've searched and tried other solutions found on SO, which are: scrapy css selector: get text of all inner tags How to get the text from child nodes if it is parents to other node in Scrapy using XPath scrapy get…

python xpath web-scraping scrapy

asked Nov 13 '18 at 09:35

Amir Asyraf

votes

1 answer

Excel VBA HTML Nested QuerySelector

Consider this extract of an html page: Document

20 Records found.

html excel vba web-scraping css-selectors

asked Nov 12 '18 at 16:38

drec4s

7,946
8
33
54

votes

3 answers

How to write a csv file line by line?

I am trying to scrape data from a website and I have collected 3 different type of information from the website. I have thousands of records in the 3 list but for simplicity, I am adding a few records. List1 = ['DealerName'] List2 =…

python web-scraping

asked Nov 01 '18 at 01:20

ShubhamA

votes

1 answer

Python Web Scraping saving Tik Tok video from url

I am trying to save videos from this url: Original: https://api2.musical.ly/aweme/v1/play/?video_id=v09044a20000beeff4c108gs7sflfdug Link changes to…

python selenium web-scraping beautifulsoup

asked Oct 23 '18 at 02:37

VickTree

votes

1 answer

Web Scraping contents of ::before ::after CSS Psuedo element using BeautifulSoup

I'm learning Web Scraping. I would like to know how can we fetch participants count from below element?

::before "255,590 Participants" ::after

Code I've tried soupy =…

css python-3.x web-scraping beautifulsoup

asked Oct 22 '18 at 19:47

Adam Iqshan

votes

2 answers

How to scrape page with BeautifulSoup? Page Source not matching Inspect Element

I'm trying to scrape a few things from this fantasy basketball page. I'm using BeautifulSoup in Python 3.5+ to do this. source_code = requests.get('http://fantasy.espn.com/basketball/league/standings?leagueId=633975') plain_text =…

python web-scraping beautifulsoup

asked Oct 20 '18 at 22:09

Warren Crasta

votes

3 answers

How to click a button on a website using Puppeteer without any class, id ,... assigned to it?

So I want to click on a button on a website. The button has no id, class,... So I should find a way to click the button with the name that's on it. In this example I should click by the name "Supreme®/The North Face® Leather Shoulder Bag" This…

html node.js web-scraping automation puppeteer

asked Oct 20 '18 at 10:43

wizencrowd

votes

1 answer

How to read wikipedia table of 2018 in film using python pandas and BeautifulSoup

I was attempting to find the movies of 2018 January to March of 2018 from wikipedia page using pandas read html. Here is my code: import pandas as pd import numpy as np link = "https://en.wikipedia.org/wiki/2018_in_film" tables =…

python pandas web-scraping beautifulsoup python-requests

asked Oct 18 '18 at 00:02

user8864088

votes

2 answers

Selenium download entire html

I have been trying to use selenium to scrape and entire web page. I expect at least a handful of them are spa's such as Angular, React, Vue so that is why I am using Selenium. I need to download the entire page (if some content isn't loaded from…

python selenium dom web-scraping pageloadstrategy

asked Oct 08 '18 at 06:19

Pink

Prev 1 2 3

…

100 Next

Questions tagged [web-scraping]

A note on spelling

Further Reading: