Python: Can't extract tbody information from website

Question

I want to extract all links of this website: https://pflegefinder.bkk-dachverband.de/pflegeheime/searchresult.php?required=1&statistics=1&searchdata%5BmaxDistance%5D=0&searchdata%5BcareType%5D=inpatientCare#/tab/general

The information I want are stored in the tbody: page code

Every time I try to extract the data I get no result.

from bs4 import BeautifulSoup
import requests
from requests_html import HTMLSession

url = "https://pflegefinder.bkk-dachverband.de/pflegeheime/searchresult.php?required=1&statistics=1&searchdata%5BmaxDistance%5D=0&searchdata%5BcareType%5D=inpatientCare#complex-searchresult"



session = HTMLSession()
r = session.get(url)
r.html.render()

soup = BeautifulSoup(r.html.html,'html.parser')

print(r.html.search("Details"))

Thank you for your help!

score 0 · Accepted Answer · answered Jan 20 '22 at 13:18

The site uses a backend api to deliver the info, if you look at your browser's Developer Tools - Network - fetch/XHR and refresh the page you'll see the data load via json in a request with a similar url to the one you posted.

You can scrape that data like this, it returns json which is easy enough to parse:

import requests

headers = {
    'Referer':'https://pflegefinder.bkk-dachverband.de/pflegeheime/searchresult.php?required=1&statistics=1&searchdata%5BmaxDistance%5D=0&searchdata%5BcareType%5D=inpatientCare',
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'
    }

for page in range(2):

    url = f'https://pflegefinder.bkk-dachverband.de/api/nursing-homes?required=1&statistics=1&maxDistance=0&careType=inpatientCare&limit=20&offset={page*20}'
    resp = requests.get(url,headers=headers).json()
    print(resp)

The api checks that you have a "Referer" header otherwise you get a 400 response.

Python: Can't extract tbody information from website

1 Answers1