Questions tagged [web-scraping]

Web scraping is the process of extracting specific information from websites that do not readily provide an API or other methods of automated data retrieval. Questions about "How To Get Started With Scraping" (e.g. with Excel VBA) should be *thoroughly researched* as numerous functional code samples are available. Web scraping methods include 3rd-party applications, development of custom software, or even manual data collection in a standardized way.

Web scraping (also known as web harvesting, web mining or web data extraction) is the act of using programming to extract information from the web.

Web scraping works by requesting HTML pages from a website and extracting specific data by taking advantage of patterns in the HTML markup, or by embedding a fully-fledged web browser. More advanced systems of web scraping, namely with regards to magnitude, scheduling, and automation, are often referred to as spiders, or web crawlers.

Potential uses include:

  • Retrieving product or stock prices comparison for comparison,

  • Contact scraping and collecting email addresses,

  • Site mashup or building an alternative front-end for an existing site,

  • Collection of real-estate pricing or auto sales statistics,

  • Website change detection

  • Building archives of dead pages

The practice of Web scraping has drawn a lot of controversies because the terms of use or copyrights for some websites and electronic publications do not allow certain kinds of data mining. While web scraping is not illegal on its own, legal issues can arise if being done with malicious or plagiaristic intentions, to circumvent a site's purchasing system or subscription fees, or other fraudulent or maligned purposes.

There have been numerous cases of lawsuits and other legal actions against companies and individuals. Before attempting to extract any information from a website in a method that is potentially contrary to the sites' indented usage, it is important to exercise due diligence in educating yourself in applicable local and international laws as well as the site's terms of service, copyrights, and trademarks. Further discussion on legal implications is can be found online including Wikipedia, Hacker News and Laws.com.

Web crawling is a component of web scraping across multiple sites, indexing information on the web using a bot or "spider" and is a universal technique adopted by most search engines while honoring exclusion requests such as those published a robots.txt file place on the site.

In contrast, web scraping focuses more on the transformation of unstructured data on the web, typically from HTML into a structured form that can be more easily stored, manipulated, and analyzed using tools such as a database or spreadsheet.

Screen scraping has a similar purpose but involves the programmatic collection of visual data from a source (as opposed to parsing data as in Web scraping) and originally involved reading the terminals memory or video data by connecting the terminals to another computer's input port.

is most often tagged along with:

   ➡       ( including , and )
   ➡         ( including and )
   ➡              ( including )
   ➡
   ➡          ( including )
   ➡          ( including )
   ➡
   ➡          (including )


A note on spelling

The verb is spelled to scrape, or as the present participle scraping, and is not to be confused with to scrap or scrapping, which is to discard something you no longer want or need, or to not continue with a plan.


Further Reading:

49536 questions
4
votes
2 answers

How do I scrape the product price from target.com product page?

I've recently learned about web scraping and wanted to create a program that scraped daily product prices. I'm using requests and bs4 in python to scrape target.com. So far this is my code: TIMES = [2, 3, 4, 5, 6, 7] url =…
4
votes
2 answers

How to scrape with BeautifulSoup waiting a second to save the soup element to let elements load complete in the page

i'm trying to scrape data from THIS WEBSITE that have 3 kind of prices in some products, (muted price, red price and black price), i observed that the red price change before the page load when the product have 3 prices. When i scrape the website i…
4
votes
2 answers

Scrapy runs all spiders at once. I want to only run one spider at a time. Scrapy crawl

I am new to Scrapy and am trying to play around with the framework. What is really frustrating is that when I run "scrapy crawl (name of spider)" it runs every single spider in my "spiders" folder. So I either have to wait out all of the spiders…
Tom H
  • 175
  • 3
  • 12
4
votes
3 answers

How to use this Datepicker with Puppeteer

I would like to crawl flight data from the following page: https://www.airprishtina.com/de/ I managed to select the airports, but this page has a Datepicker and I don't get it how to use it programmatically. With a Click on the Startdate Input i…
Simon Hansen
  • 622
  • 8
  • 15
4
votes
3 answers

RSelenium with RSDriver. Error: httr output: Failed to connect to localhost port 4445: Connection refused

I am trying to use RSelenium for webscraping. I am following the basics tutorial as explained on cran. The recommended approach is to install Docker (see tutorial as well as this stackoverflow answer). If I understand correctly, this is not an…
eigenvector
  • 313
  • 1
  • 3
  • 12
4
votes
1 answer

How to grab an URL using IMPORTXML and Xpath in Google Sheets?

Trying to grab the URLs or URL snippets of images from a webpage using Google Sheets IMPORTXML function. I'm fairly sure I have the Xpath right, but I either get nothing or a "that data can't be parsed" - and yet I've seen other examples here of…
4
votes
2 answers

How to save my scraped data in AWS s3 bucket

How to integrate my scrapping code with lambda_handler to save the data in s3 bucket. i am not able to save the data I have aws account not enterprise the account giving by aws fot 2.00. need to save the data in the s3 bucket. bucket name is…
user6882757
4
votes
1 answer

How do I retrieve Youtube's autocomplete results using Jsoup (Java)?

As shown in this image I want to retrieve autocomplete search results using Jsoup. I'm already retrieving the video URL, video title and thumbnail using the video id, but I am stuck at retrieving them from the search results. I have to complete…
raj kavadia
  • 926
  • 1
  • 10
  • 30
4
votes
1 answer

Is it possible to scrape all google scholar results on a particular topic and is it legal?

I have some Rexperience, but not with website coding, and think I was not able to select the correct CSS nodes to parse (I believe). library(rvest) library(xml2) library(selectr) library(stringr) library(jsonlite) url…
4
votes
2 answers

Chromedp: handle alert

How to catch alert box showing on web page and getting the text inside it by using chromedp I have noticed that when alert is showing up, I can see that Page.javascriptDialogOpening is showing I am using …
Salis
  • 179
  • 1
  • 20
4
votes
3 answers

Scrape data from bloomberg

I want to scrape data from the Bloomberg website. The data under "IBVC:IND Caracas Stock Exchange Stock Market Index" needs to be scraped. Here is my code so far: import requests from bs4 import BeautifulSoup as bs headers = { 'User-Agent':…
Ibtsam Ch
  • 383
  • 1
  • 8
  • 22
4
votes
2 answers

Web scraping and looping through pages with R

I am learning data scraping and, on top of that, I am quite a debutant with R (for work I use STATA, I use R only for very specific tasks). In order to learn scraping, I am exercising with a few pages on Psychology Today. I have written a function…
Fuca26
  • 215
  • 3
  • 16
4
votes
2 answers

Enter query in search bar and scrape results

I have a database with ISBN numbers of different books. I gathered them using Python and Beautifulsoup. Next I would like to add categories to the books. There is a standard when it comes to book categories. A website called https://www.bol.com/nl/…
4
votes
7 answers

BeautifulSoup - How to extract email from a website?

I'm trying to extract some informations from a website, but I don't know how to scrape the email. This code works for me : from urllib.request import urlopen as uReq from bs4 import BeautifulSoup url =…
NK20
  • 75
  • 1
  • 1
  • 5
4
votes
2 answers

WebScraping in R: extract names from `href` tags

This is my code: library(rvest) library(XML) library(xml2) url_imb <- 'https://www.imdb.com/search/title/?count=100&release_date=2016,2016&title_type=feature' web_page<-read_html(url_imb) I want to extract all Directors names related to…
Laura
  • 675
  • 10
  • 32