Can't scrape the links of different companies from a website using requests

Question

I'm trying to get the links of different companies from a webpage but the script I've tried with throws the error below. In chrome dev tools I could see that I can get the ids of different companies using post http requests. However, if I can get the ids then I will be able to make use of this link 'https://angel.co/startups/{}' adding id's in string format to make a full-fledged company link.

Webpage link

I've tried with:

import requests

link = 'https://angel.co/company_filters/search_data'
base = 'https://angel.co/startups/{}'

payload={'sort':'signal','page':'2'}

r = requests.post(link,data=payload,headers={
    'x-requested-with':'XMLHttpRequest'
    'User-Agent":"Mozilla/5.0'
    })
print(r.json())

The above script throws the following error:

raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

How can I get the links of different companies from the aforementioned site using requests?

The data is getting loaded asynchronously. You should use selenium driver instead — LazyCoder, Jul 28 '19 at 14:56
It's possible to scrape data just with `requests` - see my answer — Andrej Kesely, Jul 30 '19 at 16:28

Andrej Kesely · Accepted Answer · 2019-07-30T18:04:24.890

I've made function get_soup(page), which accept page parameter from 1 and returns soup with relevant data. You can put this function in a loop to scrape more pages:

import requests
from bs4 import BeautifulSoup

def get_soup(page=1):
    headers = {
        'Accept-Language'           : 'en-US,en;q=0.5',
        'Host'                      : 'angel.co',
        'User-Agent'                : 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0'
    }

    payload={'sort':'signal','page':str(page)}

    url = 'https://angel.co/company_filters/search_data'

    data = requests.get(url, headers=headers, data=payload).json()

    new_url = 'https://angel.co/companies/startups?' + '&'.join('ids[]={}'.format(_id) for _id in data['ids'])
    new_url += '&sort=' + data['sort']
    new_url += '&total=' + str(data['total'])
    new_url += '&page=' + str(data['page'])
    new_url += '&new=' + str(data['new']).lower()
    new_url += '&hexdigest=' + data['hexdigest']

    data = requests.get(new_url, headers=headers).json()
    return BeautifulSoup(data['html'], 'lxml')

soup = get_soup(1)

rows = []
for company, joined, location, market, website, company_size, stage, raised in zip(soup.select('.column.company'),
                            soup.select('.column.joined .value'),
                            soup.select('.column.location .value'),
                            soup.select('.column.market .value'),
                            soup.select('.column.website .value'),
                            soup.select('.column.company_size .value'),
                            soup.select('.column.stage .value'),
                            soup.select('.column.raised .value')):

    company = company.get_text(strip=True, separator=" ")
    joined = joined.get_text(strip=True)
    location = location.get_text(strip=True)
    market = market.get_text(strip=True)
    website = website.get_text(strip=True)
    company_size = company_size.get_text(strip=True)
    stage = stage.get_text(strip=True)
    raised = raised.get_text(strip=True)

    rows.append([company, joined, location, market, website, company_size, stage, raised])

from textwrap import shorten
print(''.join('{: <25}'.format(shorten(d, 25)) for d in ['Company', 'Joined', 'Location', 'Market', 'Website', 'Company Size', 'Stage', 'Raised']))
print('-' * (25*8))
for row in rows:
    print(''.join('{: <25}'.format(shorten(d, 25)) for d in row))

Prints:

Company                  Joined                   Location                 Market                   Website                  Company Size             Stage                    Raised                   
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nutanix Your [...]       May ’14                  San Jose                 Virtualization           nutanix.com              1001-5000                IPO                      $312,200,000             
EverFi                   Oct ’12                  Washington DC            Education                everfi.com               51-200                   Series C                 $61,000,000              
Butter Make friends [...]Jun ’14                  San Francisco            Messaging                getbutter.me             1-10                     Seed                     $371,500                 
Fluent The future [...]  Mar ’12                  Sydney                   Curated Web              fluent.io                -                        -                        -                        
Belly                    Sep ’12                  Chicago                  Small and Medium [...]   bellycard.com                                     Series B                 $24,975,000              
Autotech Ventures [...]  Apr ’14                  Menlo Park               Internet of Things       autotechvc.com           1-10                     -                        -                        
Oscar Health [...]       Jun ’14                  Tempe                    Technology               hioscar.com              1001-5000                                         $1,267,500,000           
Tovala Smart oven [...]  Feb ’16                  Chicago                  Home Automation          tovala.com               11-50                    Series A                 $10,800,000              
GiftRocket Online [...]  Mar ’16                  San Francisco            Gift Card                giftrocket.com           1-10                     Seed                     $520,000                 
Elemeno Health B2B [...] Apr ’16                  Oakland                  Training                 elemenohealth.com        1-10                     Seed                     $1,635,000               
Sudo Technologies [...]  Apr ’16                  Menlo Park               -                        sudo.ai                                           -                        -                        
Stypi                    Sep ’16                  -                        -                                                                          Acquired                 -                        
Amazon Alexa Amazon [...]Sep ’16                  Cambridge                Speech Recognition       developer.amazon.com     11-50                    -                        -                        
Altos Ventures A [...]   Oct ’16                  Menlo Park               Technology               altos.vc                 1-10                     -                        -                        
Flirtey Making [...]     Oct ’16                  Reno                     -                        flirtey.com              11-50                    Series A                 $16,000,000              
SV Liquidity Fund [...]  Oct ’16                  San Francisco            B2B                      svlq.io                  1-10                     -                        -                        
Princeton Ventures [...] Jan ’17                  Princeton                Technology               princetonventures.com    1-10                     -                        -                        
hulu - Beijing [...]     Jan ’17                  Beijing                  TV Production            hulu.com                 -                        -                        -                        
Distributed Systems [...]Jan ’17                  San Francisco            Identity                 pavlov.ai                1-10                     -                        -                        
Fetch Marketplace [...]  May ’17                  Atlanta                  Technology               fetchtruck.com           1-10                     Seed                     -

EDIT: For getting just links, you can do:

soup = get_soup(1)

for a in soup.select('.website a[href]'):
    print(a['href'])

Prints:

http://www.fuelpowered.com
http://www.slide.com
http://www.mparticle.com
http://www.matter.io
http://www.smartling.com
https://stensul.com
https://avametric.com/
https://ledgerinvesting.com

http://www.relativityspace.com
http://teamdom.co
http://www.wonderschool.com
http://www.upcall.com
http://focal.systems
https://asktetra.com
https://www.subdreamstudios.com/
http://www.stedi.com
http://www.magnarapp.com/
http://www.kylie.ai
http://clipboardhealth.com

Hi Andrej, I tried your script just now but encountered the same error as I was having earlier which is `raise JSONDecodeError("Expecting value", s, err.value) from None :json.decoder.JSONDecodeError: Expecting ` pointing at this line `data = requests.get(url, headers=headers, data=payload).json()`. I ran it as it is. Btw, did you intentionally use `data=payload` within a get requests? Thans. — MITHU, Jul 30 '19 at 17:54
@MITHU Running the script is working for me here...The site is using Cloudflare CDN, so maybe you are blocked on IP level. Can you try the script from different IP? Yes, you can use `data=` even on GET requests. — Andrej Kesely, Jul 30 '19 at 17:57
I'll surely try once I will get chance to activate my vpn. Wasn't I in the right track as well with my initial attempt? Thanks. — MITHU, Jul 30 '19 at 18:01
As your answer always lead me to the right direction, I pressed that checkmark in advance. If I encounter any issues, I'll let you know. Thanks. — MITHU, Jul 30 '19 at 18:15
Hi Andrej, I tried your script activating vpn, using different working ip's but still I encounter the same error as I've stated in my first comment. Thanks. — MITHU, Aug 04 '19 at 11:50
@MITHU Strange, I tried the script and it works here. Try to change and/or add more items to `headers` dictionary. When you open Firefox/Chrome tab and do request, you will see what the browser sends. https://imgur.com/pokXyJR (see right, down). Maybe adding more headers help - also change `User-Agent` etc. — Andrej Kesely, Aug 04 '19 at 12:28

score 1 · Answer 2 · answered Jul 28 '19 at 15:13

You can use selenium:

from selenium import webdriver
from bs4 import BeautifulSoup as soup
d = webdriver.Chrome('/path/to/chromedriver')
d.get('https://angel.co/companies')
links = [i.a['href'] for i in soup(d.page_source, 'html.parser').find_all('div', {'class':'photo'})]

Output:

['https://angel.co/company/orchestra-one', 'https://angel.co/company/workramp', 'https://angel.co/company/alien-labs', 'https://angel.co/company/teamdom', 'https://angel.co/company/focal-systems', 'https://angel.co/company/ripple-co', 'https://angel.co/company/solugen', 'https://angel.co/company/govpredict', 'https://angel.co/company/ring-6', 'https://angel.co/company/radiopublic', 'https://angel.co/company/function-of-beauty', 'https://angel.co/company/kid-koderz-city', 'https://angel.co/company/united-income', 'https://angel.co/company/volara', 'https://angel.co/company/optimus-ride', 'https://angel.co/company/amplitude-analytics', 'https://angel.co/company/nanonets', 'https://angel.co/company/magnar', 'https://angel.co/company/kylieai', 'https://angel.co/company/clipboardhealth']

Can't scrape the links of different companies from a website using requests

2 Answers2