Questions tagged [web-scraping]

Web scraping is the process of extracting specific information from websites that do not readily provide an API or other methods of automated data retrieval. Questions about "How To Get Started With Scraping" (e.g. with Excel VBA) should be *thoroughly researched* as numerous functional code samples are available. Web scraping methods include 3rd-party applications, development of custom software, or even manual data collection in a standardized way.

Web scraping (also known as web harvesting, web mining or web data extraction) is the act of using programming to extract information from the web.

Web scraping works by requesting HTML pages from a website and extracting specific data by taking advantage of patterns in the HTML markup, or by embedding a fully-fledged web browser. More advanced systems of web scraping, namely with regards to magnitude, scheduling, and automation, are often referred to as spiders, or web crawlers.

Potential uses include:

Retrieving product or stock prices comparison for comparison,
Contact scraping and collecting email addresses,
Site mashup or building an alternative front-end for an existing site,
Collection of real-estate pricing or auto sales statistics,
Website change detection
Building archives of dead pages

The practice of Web scraping has drawn a lot of controversies because the terms of use or copyrights for some websites and electronic publications do not allow certain kinds of data mining. While web scraping is not illegal on its own, legal issues can arise if being done with malicious or plagiaristic intentions, to circumvent a site's purchasing system or subscription fees, or other fraudulent or maligned purposes.

There have been numerous cases of lawsuits and other legal actions against companies and individuals. Before attempting to extract any information from a website in a method that is potentially contrary to the sites' indented usage, it is important to exercise due diligence in educating yourself in applicable local and international laws as well as the site's terms of service, copyrights, and trademarks. Further discussion on legal implications is can be found online including Wikipedia, Hacker News and Laws.com.

Web crawling is a component of web scraping across multiple sites, indexing information on the web using a bot or "spider" and is a universal technique adopted by most search engines while honoring exclusion requests such as those published a robots.txt file place on the site.

In contrast, web scraping focuses more on the transformation of unstructured data on the web, typically from HTML into a structured form that can be more easily stored, manipulated, and analyzed using tools such as a database or spreadsheet.

Screen scraping has a similar purpose but involves the programmatic collection of visual data from a source (as opposed to parsing data as in Web scraping) and originally involved reading the terminals memory or video data by connecting the terminals to another computer's input port.

web-scraping is most often tagged along with:

➡ ^{python ( including beautifulsoup, scrapy and selenium )}
➡ ^{javascript ( including node.js and phantomjs )}
➡ ^{r ( including rvest )}
➡ ^selenium
➡ ^{xml ( including xpath )}
➡ ^{java ( including jsoup )}
➡ ^php
➡ ^{vba (including vba-excel)}

A note on spelling

The verb is spelled to scrape, or as the present participle scraping, and is not to be confused with to scrap or scrapping, which is to discard something you no longer want or need, or to not continue with a plan.

Wikipedia: Web scraping
^{Overview of the types of web scraping, as well as techniques, software, legalities and prevention.}
GitHub: Guide to Preventing Web Scraping
^{Detailed advice on prevention of web scraping. (Original article on Stack Overflow)}
HartleyBrody: I Don't Need No Stinking API
^{A commercial blogger's views, advice and tips for scraping}
Stack Overflow: Scraping data from website using VBA
^{Discussion and examples of getting started scraping with VBA}
SitePoint: Web Scraping for Beginners
^{Theory and examples for beginner to web scraping}

49536 questions

votes

3 answers

Using BeautifulSoup to find links related to specific keyword

I have to modify this code so the scraping keeps only the links that contain a specific keyword. In my case I'm scraping a newspaper page to find news related to the term 'Brexit'. I've tried modifying the method parse_links so it only keeps the…

python web-scraping beautifulsoup web-crawler

asked Feb 28 '19 at 13:13

CarlosT

votes

2 answers

Why am I not getting any data back from website?

So I'm brand new the whole web scraping thing. I've been working on a project that requires me to get the word of the day from here. I have successfully grabbed the word now I just need to get the definition, but when I do so I get this…

python html xpath web-scraping lxml

asked Feb 26 '19 at 20:29

jaden

votes

3 answers

Selenium is really slow for me, is there something wrong with my code?

im new to webscraping and python. I have done a script before that worked just fine. Im doing basically the same thing in this one but it runs way slower. This is my code: import requests from bs4 import BeautifulSoup from selenium import…

python selenium web-scraping

asked Feb 26 '19 at 19:39

user11121374

votes

1 answer

How to scrape Facebook data using Graph API and the User Token?

I am trying scrape Facebook data, of public pages. The code I was using a couple of months (10 months ago maybe) ago was working fine. Now, when I wanted to continue that project, but the code is not working anymore. I used to use my private user…

python facebook-graph-api web-scraping social-media

asked Feb 22 '19 at 11:46

ZelelB

1,836
7
45
71

votes

2 answers

How to extract the price for the security as text from the website through Python Selenium BeautifulSoup

I am trying to simply get the price for the security shown at https://investor.vanguard.com/529-plan/profile/4514 . I run this code: from selenium import webdriver from bs4 import BeautifulSoup driver =…

python selenium web-scraping beautifulsoup webdriverwait

asked Feb 16 '19 at 05:55

Ellie The Good Dog

votes

3 answers

Extract Text Data from a Div Tag but not a from a Child H3 Tag

I have an HTML snippet that I need to get data from using BeautifuSoup:

python-3.x web-scraping beautifulsoup

asked Feb 15 '19 at 10:25

ArthurEzenwanne

votes

7 answers

Can't get rid of "keep/discard" notification while downloading ".eml" files

How can I get rid of this keep/discard notification while downloading files via python selenium chromedriver? I've tried with the following but could not succeed: chromeOptions = webdriver.ChromeOptions() prefs =…

python python-3.x selenium web-scraping selenium-chromedriver

asked Feb 02 '19 at 06:18

robots.txt

votes

4 answers

Getting all images from a webpage and save the to disk programmatically (NodeJS & Javascript)

I need to get a lot of images from a few websites and download them to my disk so that I can use them (will upload them to a blob (azure) and then save the link to my DB). GETTING THE IMAGES I know how to get the images from the html with JS, for…

javascript node.js web-scraping

asked Jan 30 '19 at 14:31

Jack

votes

1 answer

Google Maps: how can I get the exact date of a Google review for a business I don't own?

I'd like to explore patterns of Google reviews for a specific business (that I do not own). It would be useful to get the exact date of a review, rather than just the "3 months ago" or "1 year ago" approximation that you get via the web…

google-maps web-scraping google-api

asked Jan 06 '19 at 07:49

rewbs

1,958
4
22
34

votes

1 answer

Python 3.6 - image scraping with google-image-download

I want to crawl some images for my machine learning practice and found this google-image-download to very useful and the codes works out of the box. However, at the moment, it only allow not more than 100 images, which is the limit from google image…

python selenium web-scraping selenium-chromedriver

asked Jan 02 '19 at 14:29

sooon

4,718
8
63
116

votes

1 answer

How to handle Error "'NoneType' object has no attribute 'keys'", when converting list to DataFrame

Trying to create a dataframe from a list but get error "'NoneType' object has no attribute 'keys'" import numpy as np import pandas as pd import requests import json from sklearn import preprocessing from sklearn.preprocessing import…

python python-3.x dataframe web-scraping

asked Jan 02 '19 at 13:49

MisterButter

votes

1 answer

Response [400] when use file for parsing in python

It is OK (response [200]) when I try to parse with manual texting but when I change the input from a file it becomes response [400]. This the code import requests from bs4 import BeautifulSoup def people_spider(): file =…

python web-scraping

asked Dec 26 '18 at 05:04

Samudra Ajri Kifli

votes

3 answers

How to extract data from HTML using beuatiful soup

I am trying to scrape a web page and store the results in a csv/excel file. I am using beautiful soup for this. I am trying to extract the data from a soup , using the find_all function, but I am not sure how to capture the data in the field name or…

python html web-scraping beautifulsoup

asked Dec 25 '18 at 18:31

Keshav c

votes

4 answers

print text inside parent div beautifulsoup

i'm trying to fetch each product's name and price from https://www.daraz.pk/catalog/?q=risk but nothing shows up. containers = page_soup.find_all("div",{"class":"c2p6A5"}) for container in containers: pname = container.findAll("div", {"class":…

python web-scraping beautifulsoup

asked Dec 15 '18 at 11:59

Subial Ijaz

votes

2 answers

Python webscraping: BeautifulSoup not showing all html source content

I am quite new to webscraping and python. I was trying make a script that gets the Last Trade Price from http://finra-markets.morningstar.com/BondCenter/BondDetail.jsp?symbol=NFLX4333665&ticker=C647273 but some content seems to be missing when i…

javascript python selenium-webdriver iframe web-scraping

asked Dec 13 '18 at 01:29

predu

Prev 1 2 3

…

99 100 Next

Questions tagged [web-scraping]

A note on spelling

Further Reading:

Using BeautifulSoup to find links related to specific keyword

Why am I not getting any data back from website?

Selenium is really slow for me, is there something wrong with my code?

How to scrape Facebook data using Graph API and the User Token?

How to extract the price for the security as text from the website through Python Selenium BeautifulSoup

Extract Text Data from a Div Tag but not a from a Child H3 Tag

Management Team

Can't get rid of "keep/discard" notification while downloading ".eml" files

Getting all images from a webpage and save the to disk programmatically (NodeJS & Javascript)

Google Maps: how can I get the exact date of a Google review for a business I don't own?

Python 3.6 - image scraping with google-image-download

How to handle Error "'NoneType' object has no attribute 'keys'", when converting list to DataFrame

Response [400] when use file for parsing in python

How to extract data from HTML using beuatiful soup

print text inside parent div beautifulsoup

Python webscraping: BeautifulSoup not showing all html source content