Questions tagged [web-scraping]

Web scraping is the process of extracting specific information from websites that do not readily provide an API or other methods of automated data retrieval. Questions about "How To Get Started With Scraping" (e.g. with Excel VBA) should be *thoroughly researched* as numerous functional code samples are available. Web scraping methods include 3rd-party applications, development of custom software, or even manual data collection in a standardized way.

Web scraping (also known as web harvesting, web mining or web data extraction) is the act of using programming to extract information from the web.

Web scraping works by requesting HTML pages from a website and extracting specific data by taking advantage of patterns in the HTML markup, or by embedding a fully-fledged web browser. More advanced systems of web scraping, namely with regards to magnitude, scheduling, and automation, are often referred to as spiders, or web crawlers.

Potential uses include:

  • Retrieving product or stock prices comparison for comparison,

  • Contact scraping and collecting email addresses,

  • Site mashup or building an alternative front-end for an existing site,

  • Collection of real-estate pricing or auto sales statistics,

  • Website change detection

  • Building archives of dead pages

The practice of Web scraping has drawn a lot of controversies because the terms of use or copyrights for some websites and electronic publications do not allow certain kinds of data mining. While web scraping is not illegal on its own, legal issues can arise if being done with malicious or plagiaristic intentions, to circumvent a site's purchasing system or subscription fees, or other fraudulent or maligned purposes.

There have been numerous cases of lawsuits and other legal actions against companies and individuals. Before attempting to extract any information from a website in a method that is potentially contrary to the sites' indented usage, it is important to exercise due diligence in educating yourself in applicable local and international laws as well as the site's terms of service, copyrights, and trademarks. Further discussion on legal implications is can be found online including Wikipedia, Hacker News and Laws.com.

Web crawling is a component of web scraping across multiple sites, indexing information on the web using a bot or "spider" and is a universal technique adopted by most search engines while honoring exclusion requests such as those published a robots.txt file place on the site.

In contrast, web scraping focuses more on the transformation of unstructured data on the web, typically from HTML into a structured form that can be more easily stored, manipulated, and analyzed using tools such as a database or spreadsheet.

Screen scraping has a similar purpose but involves the programmatic collection of visual data from a source (as opposed to parsing data as in Web scraping) and originally involved reading the terminals memory or video data by connecting the terminals to another computer's input port.

is most often tagged along with:

   ➡       ( including , and )
   ➡         ( including and )
   ➡              ( including )
   ➡
   ➡          ( including )
   ➡          ( including )
   ➡
   ➡          (including )


A note on spelling

The verb is spelled to scrape, or as the present participle scraping, and is not to be confused with to scrap or scrapping, which is to discard something you no longer want or need, or to not continue with a plan.


Further Reading:

49536 questions
4
votes
3 answers

Is there a way to extract the displayed name of a webElement using selenium?

I'm trying to access the name of different products displayed on a website using selenium. For example on https://www.supremenewyork.com/shop/all/jackets i'm able to locate the products (webElements) and put them in a list but I can't get their name…
xszn
  • 127
  • 5
4
votes
3 answers

Scrapy - How to stop meta refresh redirect?

This is the website I am crawling. I had no problem at first, but then I encountered this error. [scrapy] DEBUG: Redirecting (meta refresh) to
gunesevitan
  • 882
  • 10
  • 25
4
votes
3 answers

Unable to let my script slide a button to the right

I've written a script in python in combination with selenium to log in to a website. The thing is my script sometimes successfully gets logged in but most of the times it comes across a slider which is meant to press and slide to the right. Website…
MITHU
  • 113
  • 3
  • 12
  • 41
4
votes
2 answers

How to fix '$(...).click is not a function' in Node/Cheerio

I am writing an application in node.js that will navigate to a website, click a button on the website, and then extract certain pieces of data from the website. All is going well except for the button-clicking aspect. I cannot seem to simulate a…
CodeMonkey JD
  • 55
  • 2
  • 7
4
votes
2 answers

How to scrape inside

I am looking for a way to efficiently scrape information formatted in the following way using puppeteer. Suppose I have a list of things on a website divided as such:
pam
  • 113
  • 1
  • 10
4
votes
2 answers

how to scrape data individually from tags using beautifulSoup?

I'm trying to scrape data from elections.in .There are three tables with the same class . below is the HTML from the website

17th General (Lok Sabha) Election Results 2019 – State Wise

Sri Sree
  • 43
  • 5
4
votes
3 answers

How to extract data from a dropdown menu using python beautifulsoup

I am trying to scrape data from a website that has a multilevel drop-down menu every time an item is selected it changes the sub items for sub drop-downs. problem is that for every loop it extracts same sub items from the drop down items. the…
Geek Online
  • 53
  • 1
  • 8
4
votes
3 answers

Writing Scrapy Python Output to JSON file

I'm new to Python and web scraping. In this program I want to write final output (product name and price from all 3 links) to JSON file. Please help! import scrapy from time import sleep import csv, os, json import random class…
amal
  • 3,470
  • 10
  • 29
  • 43
4
votes
3 answers

How to web-scrape multiple page with Selenium (Python)

I've seen several solutions to scrape multiple pages from a website, but couldn't make it work on my code. At the moment, I have this code, that is working to scrape the first page. And I would like to create a loop to scrape all the page of the…
mr-kim
  • 83
  • 1
  • 2
  • 8
4
votes
1 answer

R: scraping additional data after POST only works for first page

I would like to scrape drug informations offered by the Swiss government for an University research project from: http://www.spezialitaetenliste.ch/ShowPreparations.aspx?searchType=Substance&searchValue= The page does offer a robotx.txt file,…
captcoma
  • 1,768
  • 13
  • 29
4
votes
4 answers

Beautiful Soup find all values for a given attribute, without specifying the tag

Is there a way to get all values of a certain attribute? Example: ... ... ... Can I get all titles, even if they are in different…
klaus
  • 1,187
  • 2
  • 9
  • 19
4
votes
1 answer

Why isn't BeautifulSoup scraping the entire webpage?

Premise: I am totally new to Python and web scraping. I am trying to scrape the data about the brands on this page: https://www.interbrand.com/best-brands/best-global-brands/2018/ranking/ , but BeautifulSoup extracts the html only up to a certain…
BlancheT
  • 41
  • 1
4
votes
2 answers

Scrapy 1.6 : DNS lookup failed

I am new to Scrapy and im trying to crawl this website https://www.timeanddate.com/weather/india and its throwing DNS lookup error. The code i wrote for scraping works perfectly in shell so my guess is DNS error happens before scraping takes…
DarkSied
  • 49
  • 1
  • 6
4
votes
4 answers

Unable to make my script stop when some urls are scraped

I'v created a script in scrapy to parse the titles of different sites listed in start_urls. The script is doing it's job flawlessly. What I wish to do now is let my script stop after two of the urls are parsed no matter how many urls are…
MITHU
  • 113
  • 3
  • 12
  • 41
4
votes
0 answers

Undefined error in httr call. httr output: Recv failure: Connection was reset

I am trying to scrape this site: www.oddsportal.com. This is my code in R: library(wdman) library(RSelenium) library(rvest) library(data.table) pjs <- wdman::phantomjs(port=8912L) eCap <- list(phantomjs.page.settings.userAgent =…
Tomas
  • 153
  • 6