Questions tagged [web-scraping]

Web scraping is the process of extracting specific information from websites that do not readily provide an API or other methods of automated data retrieval. Questions about "How To Get Started With Scraping" (e.g. with Excel VBA) should be *thoroughly researched* as numerous functional code samples are available. Web scraping methods include 3rd-party applications, development of custom software, or even manual data collection in a standardized way.

Web scraping (also known as web harvesting, web mining or web data extraction) is the act of using programming to extract information from the web.

Web scraping works by requesting HTML pages from a website and extracting specific data by taking advantage of patterns in the HTML markup, or by embedding a fully-fledged web browser. More advanced systems of web scraping, namely with regards to magnitude, scheduling, and automation, are often referred to as spiders, or web crawlers.

Potential uses include:

  • Retrieving product or stock prices comparison for comparison,

  • Contact scraping and collecting email addresses,

  • Site mashup or building an alternative front-end for an existing site,

  • Collection of real-estate pricing or auto sales statistics,

  • Website change detection

  • Building archives of dead pages

The practice of Web scraping has drawn a lot of controversies because the terms of use or copyrights for some websites and electronic publications do not allow certain kinds of data mining. While web scraping is not illegal on its own, legal issues can arise if being done with malicious or plagiaristic intentions, to circumvent a site's purchasing system or subscription fees, or other fraudulent or maligned purposes.

There have been numerous cases of lawsuits and other legal actions against companies and individuals. Before attempting to extract any information from a website in a method that is potentially contrary to the sites' indented usage, it is important to exercise due diligence in educating yourself in applicable local and international laws as well as the site's terms of service, copyrights, and trademarks. Further discussion on legal implications is can be found online including Wikipedia, Hacker News and Laws.com.

Web crawling is a component of web scraping across multiple sites, indexing information on the web using a bot or "spider" and is a universal technique adopted by most search engines while honoring exclusion requests such as those published a robots.txt file place on the site.

In contrast, web scraping focuses more on the transformation of unstructured data on the web, typically from HTML into a structured form that can be more easily stored, manipulated, and analyzed using tools such as a database or spreadsheet.

Screen scraping has a similar purpose but involves the programmatic collection of visual data from a source (as opposed to parsing data as in Web scraping) and originally involved reading the terminals memory or video data by connecting the terminals to another computer's input port.

is most often tagged along with:

   ➡       ( including , and )
   ➡         ( including and )
   ➡              ( including )
   ➡
   ➡          ( including )
   ➡          ( including )
   ➡
   ➡          (including )


A note on spelling

The verb is spelled to scrape, or as the present participle scraping, and is not to be confused with to scrap or scrapping, which is to discard something you no longer want or need, or to not continue with a plan.


Further Reading:

49536 questions
4
votes
3 answers

Scraping Booking coments with python

I am trying to get the titles of Booking.com comments from this website: https://www.booking.com/reviews/co/hotel/ibis-bogota-museo.es.html?page=1;r_lang=all;rows=75, where r_lang=all basically says that the website should show comments in every…
Vladimir Vargas
  • 1,744
  • 4
  • 24
  • 48
4
votes
1 answer

How to grab quarterly and specific the date of yahoo financial data with python?

I can download the annual data from this link by the following code, but it's not the same as what's shown on the website because it's the data of June: Now I have two questions: How do I specific the date so the annual data is the same as the…
saga
  • 736
  • 2
  • 8
  • 20
4
votes
1 answer

Starting Web Scraping with Python and BeautifulSoup - Errors during step by step tutorial

Followed this tutorial about Web Scraping with Python and BeautifulSoup to learn the ropes - However Pycharm returns an error which I do not understand Hi there! Tried the above mentioned tutorial with an adjusted link as the actual link…
JohnDoe
  • 97
  • 5
  • 12
4
votes
0 answers

R - Web scraping JavaScript objects with V8

I have some experience in R but completely new to JavaScript. I am recently trying to scrape a table from this website (http://op1.win007.com/Oddslist/1599893.htm). It seems to me that the webpage is written in JavaScript and therefore the simple…
Bosco Lam
  • 43
  • 2
4
votes
0 answers

Reading HTML tag attribute names containing @ in R using xml2 package

I am trying to read a HTML document in R containing some vue.js script. This document contains tags with attributes containing @ symbol. When I read the document using read_html in R the attributes containing @ symbol are not parsed…
4
votes
1 answer

Is there a way to make selenium work asynchronously?

My objective is to scrape as many profile links as possible on Khan Academy. And then scrape some specific data on each of these profiles to write them into a CSV file. My problem is simple: the script is way to slow. Here is the script: from…
RobZ
  • 496
  • 1
  • 10
  • 26
4
votes
3 answers

Perl vs PHP to web scraping

Say we have project that requires web scraping. (parsing strings (< 40) and scraping web pages (geting meta datas and such) I am aware of that perl has great and suited cpan modules for this job, so i can take that way and don't bother myself that…
wonnie
  • 459
  • 3
  • 6
  • 19
4
votes
0 answers

URL -web scraping with R

I am trying to scrape content from LinkedIn using R, but I keep on getting an error when trying to read the HTML content. This is my code :…
Eliza R
  • 125
  • 1
  • 10
4
votes
3 answers

Python Beautifulsoup (bs4) findAll not finding all elements

From the url that is in the code, I am ultimately trying to gather all of the players names from the page. However, when I am using .findAll in order to get all of the list elements, I am yet to be successful. Please advise. from urllib.request…
datam
  • 255
  • 1
  • 3
  • 10
4
votes
0 answers

Web-scraping dynamic pages in Java

I know this question was asked before but none of the proposed solutions work in my case. I am trying to web-scrape a page of results but the problem is that 95% of div tags contain only class names that are dynamically changing. My code works for…
detrraxic
  • 156
  • 6
4
votes
1 answer

retrieve all car links from dynamic page

from selenium import webdriver options = webdriver.ChromeOptions() options.add_argument("--user-agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109…
ziji zijia
  • 43
  • 3
4
votes
3 answers

Using RSelenium: Java not found

I'm trying to execute code on R with the package RSelenium to do some webscraping, but I'm blocked at the very first step. After loading the library, I try to run this line of code: rmDr <- rsDriver(browser = "chrome", chromever = 'latest') But…
dgtrot
  • 41
  • 1
  • 1
  • 4
4
votes
2 answers

Unable to let my script keep clicking on Load more button using IE

I've created a script in vba using IE to keep clicking on the Load more hits button located at the bottom of a webpage until there is no such button is left. Here is how my script can populate that button: In the site's landing page there is a…
MITHU
  • 113
  • 3
  • 12
  • 41
4
votes
2 answers

Web scraping table gives correct reading from wrong data

I am trying to scrape this table from ESPN Neo York Knicks 2019,however from site the data is different from is actually being scraped So after making sure i am doing it correctly and searching other sites for actual dates it appears the data i am…
Amjasd Masdhash
  • 178
  • 2
  • 9
4
votes
0 answers

Chrome-Dev-Tool :- csm-hit cookie in Amazon

I'm trying to set cookies while scraping Amazon to not get caught and look like an authentic user. I'm trying to replicate the behaviour of the website. I've completely analyzed the headers, the request and response signatures etc. The only thing…
Praful Bagai
  • 16,684
  • 50
  • 136
  • 267