Questions tagged [web-scraping]

Web scraping is the process of extracting specific information from websites that do not readily provide an API or other methods of automated data retrieval. Questions about "How To Get Started With Scraping" (e.g. with Excel VBA) should be *thoroughly researched* as numerous functional code samples are available. Web scraping methods include 3rd-party applications, development of custom software, or even manual data collection in a standardized way.

Web scraping (also known as web harvesting, web mining or web data extraction) is the act of using programming to extract information from the web.

Web scraping works by requesting HTML pages from a website and extracting specific data by taking advantage of patterns in the HTML markup, or by embedding a fully-fledged web browser. More advanced systems of web scraping, namely with regards to magnitude, scheduling, and automation, are often referred to as spiders, or web crawlers.

Potential uses include:

  • Retrieving product or stock prices comparison for comparison,

  • Contact scraping and collecting email addresses,

  • Site mashup or building an alternative front-end for an existing site,

  • Collection of real-estate pricing or auto sales statistics,

  • Website change detection

  • Building archives of dead pages

The practice of Web scraping has drawn a lot of controversies because the terms of use or copyrights for some websites and electronic publications do not allow certain kinds of data mining. While web scraping is not illegal on its own, legal issues can arise if being done with malicious or plagiaristic intentions, to circumvent a site's purchasing system or subscription fees, or other fraudulent or maligned purposes.

There have been numerous cases of lawsuits and other legal actions against companies and individuals. Before attempting to extract any information from a website in a method that is potentially contrary to the sites' indented usage, it is important to exercise due diligence in educating yourself in applicable local and international laws as well as the site's terms of service, copyrights, and trademarks. Further discussion on legal implications is can be found online including Wikipedia, Hacker News and Laws.com.

Web crawling is a component of web scraping across multiple sites, indexing information on the web using a bot or "spider" and is a universal technique adopted by most search engines while honoring exclusion requests such as those published a robots.txt file place on the site.

In contrast, web scraping focuses more on the transformation of unstructured data on the web, typically from HTML into a structured form that can be more easily stored, manipulated, and analyzed using tools such as a database or spreadsheet.

Screen scraping has a similar purpose but involves the programmatic collection of visual data from a source (as opposed to parsing data as in Web scraping) and originally involved reading the terminals memory or video data by connecting the terminals to another computer's input port.

is most often tagged along with:

   ➡       ( including , and )
   ➡         ( including and )
   ➡              ( including )
   ➡
   ➡          ( including )
   ➡          ( including )
   ➡
   ➡          (including )


A note on spelling

The verb is spelled to scrape, or as the present participle scraping, and is not to be confused with to scrap or scrapping, which is to discard something you no longer want or need, or to not continue with a plan.


Further Reading:

49536 questions
4
votes
1 answer

Puppeteer - Async function in evaluate method throws error

I am trying to check if og:image source exists. If I want to call async method in evaluate function, I get Error: Evaluation failed: [object Object] error. Error: Evaluation failed: [object Object] at ExecutionContext._evaluateInternal…
Matt
  • 8,195
  • 31
  • 115
  • 225
4
votes
2 answers

How to get href from tag which contains JavaScript using Python?

I am trying to get href from a tag using Python + Selenium, but the href is having "JavaScript" in it. So I am unable to get the target URL. I am using Python 3.7.3, selenium 3.141.0. HTML:
m.gibin
  • 117
  • 1
  • 8
4
votes
2 answers

Single Scrapy Project vs. Multiple Projects

I have this dilemma on how to store all of my spiders. These spiders will be used by fed into Apache NiFi using a command line invocation and items read from stdin. I also plan to have a subset of these spiders return single item results using…
Lijo
  • 43
  • 3
4
votes
1 answer

"PATH to JAVA not found. Please check JAVA is installed." error when initialising RSelenium

I am trying to start an RSelenium session to webscrape. However, when running this code: driver <- rsDriver(browser=c("chrome"), chromever="76.0.3809.126", port = 4444L) I get this error: Error in java_check() : PATH to JAVA not found. Please…
natedjurus
  • 319
  • 3
  • 11
4
votes
4 answers

getting a substring from each element of a list

I'm trying to create a list of filter facets. I've loaded all the in to a list with bs4 and now need to grab a specific substring out of the larger string that is the . I want to load each filter facet name in to a list to end up with a…
LvP
  • 55
  • 4
4
votes
1 answer

Click on HTMLelement if condition is satisfied

i'm wondering how can i manage to click on an html element through VBA if another condition is satisfied. To make it clear, i will show you a short example: i need to analize data in a specific quarter ('let's say i need Q2) and for each quarter…
Zakiirim
  • 81
  • 1
  • 9
4
votes
1 answer

Unable to let my script run through the end

I've written a script in vba using ServerXMLHTTP requests in order to be able to use proxy along with setting timeout parameter within it. When I run the script, it appears to be working but the problem is - it gets stuck after using the first…
robots.txt
  • 96
  • 2
  • 10
  • 36
4
votes
1 answer

How can I bypass a cookie agreement page while web scraping using Python?

I hurt my nose to a cookie agreement page... What I am doing: import requests url = "https://stockhouse.com/community/bullboards/" r = requests.get(url) soup = BeautifulSoup(r.content, "html.parser") print(soup) which returns HTML from a cookie…
Vincent Labrecque
  • 304
  • 1
  • 5
  • 12
4
votes
2 answers

How to scrape
  • tag with class like active/selected?
  • I'm trying to scrape a list from a website. There are two different lists, and one will load only after the first option is chosen. Issue is, I'm unable to select the first option. I scraped the list of all available options. But after writing it, I…
    4
    votes
    2 answers

    Can't scrape the links of different companies from a website using requests

    I'm trying to get the links of different companies from a webpage but the script I've tried with throws the error below. In chrome dev tools I could see that I can get the ids of different companies using post http requests. However, if I can get…
    MITHU
    • 113
    • 3
    • 12
    • 41
    4
    votes
    1 answer

    POST request with httr package using R

    I would like to get the output from POST request using httr from following site: http://www.e-grunt.ba You can see submit form when you click "ZK Ulošci". There I would like to send POST request and get the output. For example, you can select…
    Mislav
    • 1,533
    • 16
    • 37
    4
    votes
    2 answers

    Web scraping with python how to get to the text

    I'm trying to get the text from a website but can't find a way do to it. How do I need to write it? link="https://www.ynet.co.il/articles/0,7340,L-5553905,00.html" response = requests.get(link) soup = BeautifulSoup(response.text,'html.parser') info…
    Michael
    • 189
    • 1
    • 10
    4
    votes
    2 answers

    Download CSV file from results page with options from dropdown menu

    I am a novice at web scraping with R and I am stuck on this problem: I want to use R to submit a search query to PubMed, then download a CSV file from the results page. The CSV file can be accessed by clicking 'Send to', which opens a dropdown menu,…
    kstew
    • 1,104
    • 6
    • 21
    4
    votes
    1 answer

    How to close newly constructed tab using selenium, chrome driver and python

    I am trying to scrape data from a website, there is an url which lands me a particular page, there we have links of some items, if I click on those links, it opens in a new tab, and I can extract data from there, But after extracting the data, I…
    4
    votes
    2 answers

    How can I obtain amino acid sequence from this URL?

    I want to obtain amino acid sequence from below url by using python and Selenium, but couldn't succeed. http://flybase.org/download/sequence/FBgn0003719/FBpp I've tried u Beautiful Soup and Selenium. from selenium import webdriver driver =…
    hinatafly
    • 41
    • 1