Questions tagged [web-scraping]

Web scraping is the process of extracting specific information from websites that do not readily provide an API or other methods of automated data retrieval. Questions about "How To Get Started With Scraping" (e.g. with Excel VBA) should be *thoroughly researched* as numerous functional code samples are available. Web scraping methods include 3rd-party applications, development of custom software, or even manual data collection in a standardized way.

Web scraping (also known as web harvesting, web mining or web data extraction) is the act of using programming to extract information from the web.

Web scraping works by requesting HTML pages from a website and extracting specific data by taking advantage of patterns in the HTML markup, or by embedding a fully-fledged web browser. More advanced systems of web scraping, namely with regards to magnitude, scheduling, and automation, are often referred to as spiders, or web crawlers.

Potential uses include:

  • Retrieving product or stock prices comparison for comparison,

  • Contact scraping and collecting email addresses,

  • Site mashup or building an alternative front-end for an existing site,

  • Collection of real-estate pricing or auto sales statistics,

  • Website change detection

  • Building archives of dead pages

The practice of Web scraping has drawn a lot of controversies because the terms of use or copyrights for some websites and electronic publications do not allow certain kinds of data mining. While web scraping is not illegal on its own, legal issues can arise if being done with malicious or plagiaristic intentions, to circumvent a site's purchasing system or subscription fees, or other fraudulent or maligned purposes.

There have been numerous cases of lawsuits and other legal actions against companies and individuals. Before attempting to extract any information from a website in a method that is potentially contrary to the sites' indented usage, it is important to exercise due diligence in educating yourself in applicable local and international laws as well as the site's terms of service, copyrights, and trademarks. Further discussion on legal implications is can be found online including Wikipedia, Hacker News and Laws.com.

Web crawling is a component of web scraping across multiple sites, indexing information on the web using a bot or "spider" and is a universal technique adopted by most search engines while honoring exclusion requests such as those published a robots.txt file place on the site.

In contrast, web scraping focuses more on the transformation of unstructured data on the web, typically from HTML into a structured form that can be more easily stored, manipulated, and analyzed using tools such as a database or spreadsheet.

Screen scraping has a similar purpose but involves the programmatic collection of visual data from a source (as opposed to parsing data as in Web scraping) and originally involved reading the terminals memory or video data by connecting the terminals to another computer's input port.

is most often tagged along with:

   ➡       ( including , and )
   ➡         ( including and )
   ➡              ( including )
   ➡
   ➡          ( including )
   ➡          ( including )
   ➡
   ➡          (including )


A note on spelling

The verb is spelled to scrape, or as the present participle scraping, and is not to be confused with to scrap or scrapping, which is to discard something you no longer want or need, or to not continue with a plan.


Further Reading:

49536 questions
4
votes
3 answers

Using BeautifulSoup to find links related to specific keyword

I have to modify this code so the scraping keeps only the links that contain a specific keyword. In my case I'm scraping a newspaper page to find news related to the term 'Brexit'. I've tried modifying the method parse_links so it only keeps the…
CarlosT
  • 181
  • 1
  • 18
4
votes
2 answers

Why am I not getting any data back from website?

So I'm brand new the whole web scraping thing. I've been working on a project that requires me to get the word of the day from here. I have successfully grabbed the word now I just need to get the definition, but when I do so I get this…
jaden
  • 43
  • 8
4
votes
3 answers

Selenium is really slow for me, is there something wrong with my code?

im new to webscraping and python. I have done a script before that worked just fine. Im doing basically the same thing in this one but it runs way slower. This is my code: import requests from bs4 import BeautifulSoup from selenium import…
user11121374
  • 43
  • 1
  • 3
4
votes
1 answer

How to scrape Facebook data using Graph API and the User Token?

I am trying scrape Facebook data, of public pages. The code I was using a couple of months (10 months ago maybe) ago was working fine. Now, when I wanted to continue that project, but the code is not working anymore. I used to use my private user…
ZelelB
  • 1,836
  • 7
  • 45
  • 71
4
votes
2 answers

How to extract the price for the security as text from the website through Python Selenium BeautifulSoup

I am trying to simply get the price for the security shown at https://investor.vanguard.com/529-plan/profile/4514 . I run this code: from selenium import webdriver from bs4 import BeautifulSoup driver =…
4
votes
3 answers

Extract Text Data from a Div Tag but not a from a Child H3 Tag

I have an HTML snippet that I need to get data from using BeautifuSoup:
ArthurEzenwanne
  • 176
  • 2
  • 10
4
votes
7 answers

Can't get rid of "keep/discard" notification while downloading ".eml" files

How can I get rid of this keep/discard notification while downloading files via python selenium chromedriver? I've tried with the following but could not succeed: chromeOptions = webdriver.ChromeOptions() prefs =…
4
votes
4 answers

Getting all images from a webpage and save the to disk programmatically (NodeJS & Javascript)

I need to get a lot of images from a few websites and download them to my disk so that I can use them (will upload them to a blob (azure) and then save the link to my DB). GETTING THE IMAGES I know how to get the images from the html with JS, for…
Jack
  • 491
  • 7
  • 27
4
votes
1 answer

Google Maps: how can I get the exact date of a Google review for a business I don't own?

I'd like to explore patterns of Google reviews for a specific business (that I do not own). It would be useful to get the exact date of a review, rather than just the "3 months ago" or "1 year ago" approximation that you get via the web…
rewbs
  • 1,958
  • 4
  • 22
  • 34
4
votes
1 answer

Python 3.6 - image scraping with google-image-download

I want to crawl some images for my machine learning practice and found this google-image-download to very useful and the codes works out of the box. However, at the moment, it only allow not more than 100 images, which is the limit from google image…
sooon
  • 4,718
  • 8
  • 63
  • 116
4
votes
1 answer

How to handle Error "'NoneType' object has no attribute 'keys'", when converting list to DataFrame

Trying to create a dataframe from a list but get error "'NoneType' object has no attribute 'keys'" import numpy as np import pandas as pd import requests import json from sklearn import preprocessing from sklearn.preprocessing import…
MisterButter
  • 749
  • 1
  • 10
  • 27
4
votes
1 answer

Response [400] when use file for parsing in python

It is OK (response [200]) when I try to parse with manual texting but when I change the input from a file it becomes response [400]. This the code import requests from bs4 import BeautifulSoup def people_spider(): file =…
4
votes
3 answers

How to extract data from HTML using beuatiful soup

I am trying to scrape a web page and store the results in a csv/excel file. I am using beautiful soup for this. I am trying to extract the data from a soup , using the find_all function, but I am not sure how to capture the data in the field name or…
Keshav c
  • 43
  • 4
4
votes
4 answers

print text inside parent div beautifulsoup

i'm trying to fetch each product's name and price from https://www.daraz.pk/catalog/?q=risk but nothing shows up. containers = page_soup.find_all("div",{"class":"c2p6A5"}) for container in containers: pname = container.findAll("div", {"class":…
4
votes
2 answers

Python webscraping: BeautifulSoup not showing all html source content

I am quite new to webscraping and python. I was trying make a script that gets the Last Trade Price from http://finra-markets.morningstar.com/BondCenter/BondDetail.jsp?symbol=NFLX4333665&ticker=C647273 but some content seems to be missing when i…
predu
  • 43
  • 1
  • 4