Questions tagged [web-scraping]

Web scraping is the process of extracting specific information from websites that do not readily provide an API or other methods of automated data retrieval. Questions about "How To Get Started With Scraping" (e.g. with Excel VBA) should be *thoroughly researched* as numerous functional code samples are available. Web scraping methods include 3rd-party applications, development of custom software, or even manual data collection in a standardized way.

Web scraping (also known as web harvesting, web mining or web data extraction) is the act of using programming to extract information from the web.

Web scraping works by requesting HTML pages from a website and extracting specific data by taking advantage of patterns in the HTML markup, or by embedding a fully-fledged web browser. More advanced systems of web scraping, namely with regards to magnitude, scheduling, and automation, are often referred to as spiders, or web crawlers.

Potential uses include:

Retrieving product or stock prices comparison for comparison,
Contact scraping and collecting email addresses,
Site mashup or building an alternative front-end for an existing site,
Collection of real-estate pricing or auto sales statistics,
Website change detection
Building archives of dead pages

The practice of Web scraping has drawn a lot of controversies because the terms of use or copyrights for some websites and electronic publications do not allow certain kinds of data mining. While web scraping is not illegal on its own, legal issues can arise if being done with malicious or plagiaristic intentions, to circumvent a site's purchasing system or subscription fees, or other fraudulent or maligned purposes.

There have been numerous cases of lawsuits and other legal actions against companies and individuals. Before attempting to extract any information from a website in a method that is potentially contrary to the sites' indented usage, it is important to exercise due diligence in educating yourself in applicable local and international laws as well as the site's terms of service, copyrights, and trademarks. Further discussion on legal implications is can be found online including Wikipedia, Hacker News and Laws.com.

Web crawling is a component of web scraping across multiple sites, indexing information on the web using a bot or "spider" and is a universal technique adopted by most search engines while honoring exclusion requests such as those published a robots.txt file place on the site.

In contrast, web scraping focuses more on the transformation of unstructured data on the web, typically from HTML into a structured form that can be more easily stored, manipulated, and analyzed using tools such as a database or spreadsheet.

Screen scraping has a similar purpose but involves the programmatic collection of visual data from a source (as opposed to parsing data as in Web scraping) and originally involved reading the terminals memory or video data by connecting the terminals to another computer's input port.

web-scraping is most often tagged along with:

➡ ^{python ( including beautifulsoup, scrapy and selenium )}
➡ ^{javascript ( including node.js and phantomjs )}
➡ ^{r ( including rvest )}
➡ ^selenium
➡ ^{xml ( including xpath )}
➡ ^{java ( including jsoup )}
➡ ^php
➡ ^{vba (including vba-excel)}

A note on spelling

The verb is spelled to scrape, or as the present participle scraping, and is not to be confused with to scrap or scrapping, which is to discard something you no longer want or need, or to not continue with a plan.

Wikipedia: Web scraping
^{Overview of the types of web scraping, as well as techniques, software, legalities and prevention.}
GitHub: Guide to Preventing Web Scraping
^{Detailed advice on prevention of web scraping. (Original article on Stack Overflow)}
HartleyBrody: I Don't Need No Stinking API
^{A commercial blogger's views, advice and tips for scraping}
Stack Overflow: Scraping data from website using VBA
^{Discussion and examples of getting started scraping with VBA}
SitePoint: Web Scraping for Beginners
^{Theory and examples for beginner to web scraping}

49536 questions

votes

3 answers

Is there a way to extract the displayed name of a webElement using selenium?

I'm trying to access the name of different products displayed on a website using selenium. For example on https://www.supremenewyork.com/shop/all/jackets i'm able to locate the products (webElements) and put them in a list but I can't get their name…

java selenium web-scraping

asked Jul 05 '19 at 15:52

xszn

votes

3 answers

Scrapy - How to stop meta refresh redirect?

This is the website I am crawling. I had no problem at first, but then I encountered this error. [scrapy] DEBUG: Redirecting (meta refresh) to

python http redirect web-scraping scrapy

asked Jul 03 '19 at 09:12

gunesevitan

votes

3 answers

Unable to let my script slide a button to the right

I've written a script in python in combination with selenium to log in to a website. The thing is my script sometimes successfully gets logged in but most of the times it comes across a slider which is meant to press and slide to the right. Website…

python python-3.x selenium selenium-webdriver web-scraping

asked Jun 30 '19 at 11:17

MITHU

votes

2 answers

How to fix '$(...).click is not a function' in Node/Cheerio

I am writing an application in node.js that will navigate to a website, click a button on the website, and then extract certain pieces of data from the website. All is going well except for the button-clicking aspect. I cannot seem to simulate a…

javascript node.js web-scraping request cheerio

asked Jun 19 '19 at 20:26

CodeMonkey JD

votes

2 answers

How to scrape inside
list using puppeteer

I am looking for a way to efficiently scrape information formatted in the following way using puppeteer. Suppose I have a list of things on a website divided as such:

…

javascript html web-scraping puppeteer

asked Jun 03 '19 at 06:05

pam

votes

2 answers

how to scrape data individually from tags using beautifulSoup?

I'm trying to scrape data from elections.in .There are three tables with the same class . below is the HTML from the website

17th General (Lok Sabha) Election Results 2019 – State Wise

python python-3.x web-scraping beautifulsoup

asked May 27 '19 at 12:45

Sri Sree

votes

3 answers

How to extract data from a dropdown menu using python beautifulsoup

I am trying to scrape data from a website that has a multilevel drop-down menu every time an item is selected it changes the sub items for sub drop-downs. problem is that for every loop it extracts same sub items from the drop down items. the…

python web-scraping drop-down-menu beautifulsoup

asked May 27 '19 at 06:16

Geek Online

votes

3 answers

Writing Scrapy Python Output to JSON file

I'm new to Python and web scraping. In this program I want to write final output (product name and price from all 3 links) to JSON file. Please help! import scrapy from time import sleep import csv, os, json import random class…

python json web-scraping scrapy append

asked May 26 '19 at 16:50

amal

3,470
10
29
43

votes

3 answers

How to web-scrape multiple page with Selenium (Python)

I've seen several solutions to scrape multiple pages from a website, but couldn't make it work on my code. At the moment, I have this code, that is working to scrape the first page. And I would like to create a loop to scrape all the page of the…

python-3.x selenium-webdriver web-scraping beautifulsoup

asked May 17 '19 at 08:02

mr-kim

votes

1 answer

R: scraping additional data after POST only works for first page

I would like to scrape drug informations offered by the Swiss government for an University research project from: http://www.spezialitaetenliste.ch/ShowPreparations.aspx?searchType=Substance&searchValue= The page does offer a robotx.txt file,…

r web-scraping rvest

asked May 09 '19 at 22:46

captcoma

1,768
13
29

votes

4 answers

Beautiful Soup find all values for a given attribute, without specifying the tag

Is there a way to get all values of a certain attribute? Example: ... ... ... Can I get all titles, even if they are in different…

python-3.x web-scraping beautifulsoup

asked May 09 '19 at 22:46

klaus

1,187
2
9
19

votes

1 answer

Why isn't BeautifulSoup scraping the entire webpage?

Premise: I am totally new to Python and web scraping. I am trying to scrape the data about the brands on this page: https://www.interbrand.com/best-brands/best-global-brands/2018/ranking/ , but BeautifulSoup extracts the html only up to a certain…

python web-scraping beautifulsoup

asked May 07 '19 at 10:50

BlancheT

votes

2 answers

Scrapy 1.6 : DNS lookup failed

I am new to Scrapy and im trying to crawl this website https://www.timeanddate.com/weather/india and its throwing DNS lookup error. The code i wrote for scraping works perfectly in shell so my guess is DNS error happens before scraping takes…

python-3.x web-scraping scrapy

asked May 02 '19 at 06:44

DarkSied

votes

4 answers

Unable to make my script stop when some urls are scraped

I'v created a script in scrapy to parse the titles of different sites listed in start_urls. The script is doing it's job flawlessly. What I wish to do now is let my script stop after two of the urls are parsed no matter how many urls are…

python python-3.x web-scraping scrapy

asked Apr 22 '19 at 09:19

MITHU

votes

0 answers

Undefined error in httr call. httr output: Recv failure: Connection was reset

I am trying to scrape this site: www.oddsportal.com. This is my code in R: library(wdman) library(RSelenium) library(rvest) library(data.table) pjs <- wdman::phantomjs(port=8912L) eCap <- list(phantomjs.page.settings.userAgent =…

r web-scraping phantomjs rvest rselenium

asked Apr 17 '19 at 16:02

Tomas

Prev 1 2 3

…

99 100 Next

Questions tagged [web-scraping]

A note on spelling

Further Reading:

17th General (Lok Sabha) Election Results 2019 – State Wise