How to Build a Dynamic Web Scraper/Crawler: Python

Question

Not really sure the complexity of this question, but figured I'd give it a shot.

How can I create a web crawler/scraper (not sure which I'd need) to get a csv of all CEO pay-ratio data. https://www.bloomberg.com/graphics/ceo-pay-ratio/

I'd like this information for further analysis, however, I am not sure how to retrieve it for a dynamic webpage. I have built web scrapers in the past, but for simple websites and functions.

If you could point me to a good resource or post the code below I will forever be in your debt.

Thanks in advance!

score 0 · Answer 1 · answered Jul 03 '18 at 06:22

Since the website seems to load the content dynamically I believe you will be in need of Selenium, a library that automates browsers, and BeautifulSoup, a library to parse the resulting webpages.

Since the part of the website you are interested in is just the one page and you only need to retrieve the data I would suggest you to first investigate how the data are loaded to the page. It is plausible that you could make directly a request to their server with the same parameters as the script to retrieve directly the data you are interest in.

To make such a request you could consider using yet another library called requests.

A. STEFANI · Accepted Answer · 2018-07-03T21:51:39.090

Note that scraping this website may be flagged "as a violation of terms of service", this particular website use multiple tech to avoid the scraping based on script engine.

If you inspect the webpage, you may observe that when you click on the next button there is no XHR request. So you may deduce that the content are loaded only one time.

If you sort the request data by size, you will find that all data are loaded from a json file

Using python (but you need to open the page just before running the python script):

import requests
data=requests.get("https://www.bloomberg.com/graphics/ceo-pay-ratio/live-data/ceo-pay-ratio/live/data.json").json()
for each in data['companies']:
    try:
        print "Company",each['c'],"=> CEO pay ratio",each['cpr']
    except:
        print "Company",each['c'],"=> no CEO pay ratio !"

Which give you:

Company Aflac Inc => CEO pay ratio 300
Company American Campus Communities Inc => CEO pay ratio 226
Company Aetna Inc => CEO pay ratio 235
Company Ameren Corp => CEO pay ratio 66
Company AmerisourceBergen Corp => CEO pay ratio 0
Company Advance Auto Parts Inc => CEO pay ratio 329
Company American International Group Inc => CEO pay ratio 697
Company Arthur J Gallagher & Co => CEO pay ratio 126
Company Arch Capital Group Ltd => CEO pay ratio 104
Company ACADIA Pharmaceuticals Inc => CEO pay ratio 54
[...]

Maybe better to open the json in webrowser then save it locally than trying to request the website.

After local saving the json as data.json you can read it with:

import json

with open("data.json","r") as f:
    cont=f.read()

data=json.loads(cont)

for each in data['companies']:
    try:
        print "Company",each['c'],"=> CEO pay ratio",each['cpr']
    except:
        print "Company",each['c'],"=> no CEO pay ratio !"

import requests data=requests.get("https://www.bloomberg.com/graphics/ceo-pay-ratio/live-data/ceo-pay-ratio/live/data.json").json() for each in data['companies']: try: print("Company",each['c'],"=> CEO pay ratio",each['cpr']) except: print("Company",each['c'],"=> no CEO pay ratio !") after entering this code i got a traceback error... is there anyway you could rewrite the code for a locally saved .json or .csv Really appreciate it :) — Sean. D 1528, Jul 03 '18 at 08:14
Traceback (most recent call last): File ".\CEOpayratio.py", line 2, in data=requests.get("https://www.bloomberg.com/graphics/ceo-pay-ratio/live-data/ceo-pay-ratio/live/data.json").json() File "C:\Users\Seane\AppData\Local\Programs\Python\Python36\lib\site-packages\requests\models.py", line 896, in json return complexjson.loads(self.text, **kwargs) File "C:\Users\Seane\AppData\Local\Programs\Python\Python36\lib\site-packages\simplejson\__init__.py", line 518, in loads return _default_decoder.decode(s) — Sean. D 1528, Jul 03 '18 at 08:19

How to Build a Dynamic Web Scraper/Crawler: Python

2 Answers2