Questions tagged [web-scraping]

Web scraping is the process of extracting specific information from websites that do not readily provide an API or other methods of automated data retrieval. Questions about "How To Get Started With Scraping" (e.g. with Excel VBA) should be *thoroughly researched* as numerous functional code samples are available. Web scraping methods include 3rd-party applications, development of custom software, or even manual data collection in a standardized way.

Web scraping (also known as web harvesting, web mining or web data extraction) is the act of using programming to extract information from the web.

Web scraping works by requesting HTML pages from a website and extracting specific data by taking advantage of patterns in the HTML markup, or by embedding a fully-fledged web browser. More advanced systems of web scraping, namely with regards to magnitude, scheduling, and automation, are often referred to as spiders, or web crawlers.

Potential uses include:

Retrieving product or stock prices comparison for comparison,
Contact scraping and collecting email addresses,
Site mashup or building an alternative front-end for an existing site,
Collection of real-estate pricing or auto sales statistics,
Website change detection
Building archives of dead pages

The practice of Web scraping has drawn a lot of controversies because the terms of use or copyrights for some websites and electronic publications do not allow certain kinds of data mining. While web scraping is not illegal on its own, legal issues can arise if being done with malicious or plagiaristic intentions, to circumvent a site's purchasing system or subscription fees, or other fraudulent or maligned purposes.

There have been numerous cases of lawsuits and other legal actions against companies and individuals. Before attempting to extract any information from a website in a method that is potentially contrary to the sites' indented usage, it is important to exercise due diligence in educating yourself in applicable local and international laws as well as the site's terms of service, copyrights, and trademarks. Further discussion on legal implications is can be found online including Wikipedia, Hacker News and Laws.com.

Web crawling is a component of web scraping across multiple sites, indexing information on the web using a bot or "spider" and is a universal technique adopted by most search engines while honoring exclusion requests such as those published a robots.txt file place on the site.

In contrast, web scraping focuses more on the transformation of unstructured data on the web, typically from HTML into a structured form that can be more easily stored, manipulated, and analyzed using tools such as a database or spreadsheet.

Screen scraping has a similar purpose but involves the programmatic collection of visual data from a source (as opposed to parsing data as in Web scraping) and originally involved reading the terminals memory or video data by connecting the terminals to another computer's input port.

web-scraping is most often tagged along with:

➡ ^{python ( including beautifulsoup, scrapy and selenium )}
➡ ^{javascript ( including node.js and phantomjs )}
➡ ^{r ( including rvest )}
➡ ^selenium
➡ ^{xml ( including xpath )}
➡ ^{java ( including jsoup )}
➡ ^php
➡ ^{vba (including vba-excel)}

A note on spelling

The verb is spelled to scrape, or as the present participle scraping, and is not to be confused with to scrap or scrapping, which is to discard something you no longer want or need, or to not continue with a plan.

Wikipedia: Web scraping
^{Overview of the types of web scraping, as well as techniques, software, legalities and prevention.}
GitHub: Guide to Preventing Web Scraping
^{Detailed advice on prevention of web scraping. (Original article on Stack Overflow)}
HartleyBrody: I Don't Need No Stinking API
^{A commercial blogger's views, advice and tips for scraping}
Stack Overflow: Scraping data from website using VBA
^{Discussion and examples of getting started scraping with VBA}
SitePoint: Web Scraping for Beginners
^{Theory and examples for beginner to web scraping}

49536 questions

votes

3 answers

Scraping Booking coments with python

I am trying to get the titles of Booking.com comments from this website: https://www.booking.com/reviews/co/hotel/ibis-bogota-museo.es.html?page=1;r_lang=all;rows=75, where r_lang=all basically says that the website should show comments in every…

web-scraping beautifulsoup urllib

asked Apr 13 '19 at 19:23

Vladimir Vargas

1,744
4
24
48

votes

1 answer

How to grab quarterly and specific the date of yahoo financial data with python?

I can download the annual data from this link by the following code, but it's not the same as what's shown on the website because it's the data of June: Now I have two questions: How do I specific the date so the annual data is the same as the…

python web-scraping yahoo-finance

asked Apr 09 '19 at 01:56

saga

votes

1 answer

Starting Web Scraping with Python and BeautifulSoup - Errors during step by step tutorial

Followed this tutorial about Web Scraping with Python and BeautifulSoup to learn the ropes - However Pycharm returns an error which I do not understand Hi there! Tried the above mentioned tutorial with an adjusted link as the actual link…

python web-scraping beautifulsoup

asked Apr 06 '19 at 19:59

JohnDoe

votes

0 answers

R - Web scraping JavaScript objects with V8

I have some experience in R but completely new to JavaScript. I am recently trying to scrape a table from this website (http://op1.win007.com/Oddslist/1599893.htm). It seems to me that the webpage is written in JavaScript and therefore the simple…

javascript r web-scraping v8 rvest

asked Apr 01 '19 at 13:50

Bosco Lam

votes

0 answers

Reading HTML tag attribute names containing @ in R using xml2 package

I am trying to read a HTML document in R containing some vue.js script. This document contains tags with attributes containing @ symbol. When I read the document using read_html in R the attributes containing @ symbol are not parsed…

html r web-scraping rvest xml2

asked Mar 31 '19 at 19:07

Rajesh Talluri

votes

1 answer

Is there a way to make selenium work asynchronously?

My objective is to scrape as many profile links as possible on Khan Academy. And then scrape some specific data on each of these profiles to write them into a CSV file. My problem is simple: the script is way to slow. Here is the script: from…

python-3.x selenium asynchronous web-scraping thread-safety

asked Mar 30 '19 at 15:43

RobZ

votes

3 answers

Perl vs PHP to web scraping

Say we have project that requires web scraping. (parsing strings (< 40) and scraping web pages (geting meta datas and such) I am aware of that perl has great and suited cpan modules for this job, so i can take that way and don't bother myself that…

php python perl performance web-scraping

asked Apr 04 '11 at 12:24

wonnie

votes

0 answers

URL -web scraping with R

I am trying to scrape content from LinkedIn using R, but I keep on getting an error when trying to read the HTML content. This is my code :…

html r web-scraping

asked Mar 26 '19 at 09:25

Eliza R

votes

3 answers

Python Beautifulsoup (bs4) findAll not finding all elements

From the url that is in the code, I am ultimately trying to gather all of the players names from the page. However, when I am using .findAll in order to get all of the list elements, I am yet to be successful. Please advise. from urllib.request…

python web-scraping beautifulsoup

asked Mar 21 '19 at 05:49

datam

votes

0 answers

Web-scraping dynamic pages in Java

I know this question was asked before but none of the proposed solutions work in my case. I am trying to web-scrape a page of results but the problem is that 95% of div tags contain only class names that are dynamically changing. My code works for…

selenium web-scraping

asked Mar 19 '19 at 11:32

detrraxic

votes

1 answer

retrieve all car links from dynamic page

from selenium import webdriver options = webdriver.ChromeOptions() options.add_argument("--user-agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109…

python web-scraping

asked Mar 14 '19 at 06:34

ziji zijia

votes

3 answers

Using RSelenium: Java not found

I'm trying to execute code on R with the package RSelenium to do some webscraping, but I'm blocked at the very first step. After loading the library, I try to run this line of code: rmDr <- rsDriver(browser = "chrome", chromever = 'latest') But…

r web-scraping rselenium

asked Mar 11 '19 at 12:49

dgtrot

votes

2 answers

Unable to let my script keep clicking on Load more button using IE

I've created a script in vba using IE to keep clicking on the Load more hits button located at the bottom of a webpage until there is no such button is left. Here is how my script can populate that button: In the site's landing page there is a…

vba web-scraping internet-explorer-11

asked Mar 08 '19 at 19:39

MITHU

votes

2 answers

Web scraping table gives correct reading from wrong data

I am trying to scrape this table from ESPN Neo York Knicks 2019,however from site the data is different from is actually being scraped So after making sure i am doing it correctly and searching other sites for actual dates it appears the data i am…

python web-scraping beautifulsoup

asked Mar 03 '19 at 17:15

Amjasd Masdhash

votes

0 answers

Chrome-Dev-Tool :- csm-hit cookie in Amazon

I'm trying to set cookies while scraping Amazon to not get caught and look like an authentic user. I'm trying to replicate the behaviour of the website. I've completely analyzed the headers, the request and response signatures etc. The only thing…

web-scraping cookies google-chrome-devtools

asked Mar 02 '19 at 19:53

Praful Bagai

16,684
50
136
267

Prev 1 2 3

…

99 100 Next

Questions tagged [web-scraping]

A note on spelling

Further Reading: