Questions tagged [web-scraping]

Web scraping is the process of extracting specific information from websites that do not readily provide an API or other methods of automated data retrieval. Questions about "How To Get Started With Scraping" (e.g. with Excel VBA) should be *thoroughly researched* as numerous functional code samples are available. Web scraping methods include 3rd-party applications, development of custom software, or even manual data collection in a standardized way.

Web scraping (also known as web harvesting, web mining or web data extraction) is the act of using programming to extract information from the web.

Web scraping works by requesting HTML pages from a website and extracting specific data by taking advantage of patterns in the HTML markup, or by embedding a fully-fledged web browser. More advanced systems of web scraping, namely with regards to magnitude, scheduling, and automation, are often referred to as spiders, or web crawlers.

Potential uses include:

  • Retrieving product or stock prices comparison for comparison,

  • Contact scraping and collecting email addresses,

  • Site mashup or building an alternative front-end for an existing site,

  • Collection of real-estate pricing or auto sales statistics,

  • Website change detection

  • Building archives of dead pages

The practice of Web scraping has drawn a lot of controversies because the terms of use or copyrights for some websites and electronic publications do not allow certain kinds of data mining. While web scraping is not illegal on its own, legal issues can arise if being done with malicious or plagiaristic intentions, to circumvent a site's purchasing system or subscription fees, or other fraudulent or maligned purposes.

There have been numerous cases of lawsuits and other legal actions against companies and individuals. Before attempting to extract any information from a website in a method that is potentially contrary to the sites' indented usage, it is important to exercise due diligence in educating yourself in applicable local and international laws as well as the site's terms of service, copyrights, and trademarks. Further discussion on legal implications is can be found online including Wikipedia, Hacker News and Laws.com.

Web crawling is a component of web scraping across multiple sites, indexing information on the web using a bot or "spider" and is a universal technique adopted by most search engines while honoring exclusion requests such as those published a robots.txt file place on the site.

In contrast, web scraping focuses more on the transformation of unstructured data on the web, typically from HTML into a structured form that can be more easily stored, manipulated, and analyzed using tools such as a database or spreadsheet.

Screen scraping has a similar purpose but involves the programmatic collection of visual data from a source (as opposed to parsing data as in Web scraping) and originally involved reading the terminals memory or video data by connecting the terminals to another computer's input port.

is most often tagged along with:

   ➡       ( including , and )
   ➡         ( including and )
   ➡              ( including )
   ➡
   ➡          ( including )
   ➡          ( including )
   ➡
   ➡          (including )


A note on spelling

The verb is spelled to scrape, or as the present participle scraping, and is not to be confused with to scrap or scrapping, which is to discard something you no longer want or need, or to not continue with a plan.


Further Reading:

49536 questions
4
votes
1 answer

How to scrape and extract all the subcategories names from all its associated pages for a wikipedia category using python 3.6?

I want to scrape all the subcategories and pages under the category header of the Category page: "Category:Computer science". The link for the same is as follows: http://en.wikipedia.org/wiki/Category:Computer_science. I have got an idea regarding…
M S
  • 894
  • 1
  • 13
  • 41
4
votes
2 answers

FileNotFoundError in 'wb' file mode in Python?

I am trying to write a program that downloads all the xkcd comics images and save them in a directory, with all the images name as title.png, title being the title of the comic. Here's the code for it: #Downloads all the xkcd comics import…
Udasi Tharani
  • 141
  • 2
  • 4
  • 13
4
votes
1 answer

How to extract the file modification time of a scraped image?

I'm trying to scrape part of a part-website that contain images of the parts, to collect some statistics. However, there is no url or image upload or creation date, so I have to use the approximate image file modification-date to get this info.…
not2qubit
  • 14,531
  • 8
  • 95
  • 135
4
votes
2 answers

Scrapy: scraping data from Pagination

so far I have scraped data from one page. I want to continue until the end of the pagination. Click Here to view the page There seems to be a problem because the href contains a javascript element.
Riwaj Chalise
  • 637
  • 1
  • 13
  • 27
4
votes
1 answer

Web Scraping a tableauViz into an R dataframe

I have spent a lot of time searching for an answer to this, but have not found anything yet. What I am trying to accomplish is to scrape Tableau table information that is contained in a tableauViz element and propagate it into an R dataframe. In my…
UTexas80
  • 73
  • 6
4
votes
1 answer

Web scraping using selenium and bs4

I'm trying to build a dataframe based on web scraping of that page https://www.schoolholidayseurope.eu/choose-a-country html firstable i said to selenium to click on page of my choice then i put xpath and tags elements for build header and body but…
4
votes
5 answers

Python 3: How to web scrape text from div that contains multiple class values

I'm trying to web scrape a website (Here is the link to website), but the div in the page seems to have multiple class attributes which is making me hard to scrape the data. I tried to look for historical questions posted on Stackoverflow, but could…
DanLee
  • 339
  • 1
  • 4
  • 11
4
votes
2 answers

How to download file from a page using python

I am having troubles downloading txt file from this page: https://www.ceps.cz/en/all-data#RegulationEnergy (when you scroll down and see Download: txt, xls and xml). My goal is to create scraper that will go to the linked page, clicks on the txt…
Loko
  • 41
  • 2
4
votes
0 answers

multiprocessing pool with a dictionary as one of the arguments?

Is it possible to use Pool.map() on a function that contains an empty dictionary as one of its arguments? I am new to multiprocessing and want to parallise a web-scraping function. I tried following the example from this site however it doesn't…
Spencer Trinh
  • 743
  • 12
  • 31
4
votes
2 answers

Google scraping using python - requests: How to avoid being blocked due to many requests?

For a school project I need get the web addresses of 200 companies (based on a list). My script is working fine, but when I'm around the company 80, I get blocked by google. This is the message that I'm getting. > Our systems have detected unusual…
PAstudilloE
  • 659
  • 13
  • 24
4
votes
1 answer

Scraping pagination with Python

I`m trying to scrape some data for airlines from the following website: http://www.airlinequality.com/airline-reviews/airasia-x[1]. I managed to get the data I need, but I am struggling with pagination on the web page. I`m trying to get all the…
onr
  • 296
  • 4
  • 18
4
votes
1 answer

Table element not showing in BeautifulSoup

I am trying to extract table data from this web site Following is the code-- import requests from bs4 import BeautifulSoup as bs page = requests.get('https://www.vitalityservicing.com/serviceapi/Monitoring/QueueDepth?tenantId=1') soup =…
spark
  • 1,271
  • 1
  • 12
  • 18
4
votes
2 answers

Selenium, Presence of one of many elements located?

Building off of the answer to How to wait until the page is loaded with Selenium for Python? I am attempting to create a method that allows multiple elements to be polled for presence using Expected Conditions. I receive an error 'bool' object is…
Liquidgenius
  • 639
  • 5
  • 17
  • 32
4
votes
3 answers

How do I check if a URL has a link on botw.org or not?

I am developing an application in which I have to check whether a link exists on botw.org for a given URL. Is there any free API available to check botw.org, or any other source to check this? thanks!
Tokendra Kumar Sahu
  • 3,524
  • 11
  • 28
  • 29
4
votes
1 answer

Manually change response URL during Puppeteer request interception

I'm having a hard time navigating relative urls with puppeteer for a specific use case. Below you can see the basic setup and an pseudo example describing the problem. Essentially I want to change the current url the browser thinks he is at. What I…
joe.hart
  • 185
  • 1
  • 2
  • 8
1 2 3
99
100