grabbing the content inside the html content using Python

Question

The Chinese website here mainly describes the information of one company. Since there are many pages containing similar contents, I decided to learn data crawler in Python.

Basic code

import requests
from bs4 import BeautifulSoup
page = requests.get('http://182.148.109.184/enterprise- 
info!getCompanyInfo.action?companyid=1000356')

soup = BeautifulSoup(page.text, 'html.parser')
source_content = soup.find(class_='rightSide').find(class_='content register').find(class_='formestyle')

The information I want to collect

The figure was captured in Chrome inspect element page.

Maybe Chinese is not friendly here, I created an example here for better illustration.

<th> the variable name </th> => For example, "company name", "company location"
<td> the target data I want to save </td>

My question

Based on my basic code, the source_content contain no information inside . The output file was shown like this:

Comparing fig1, 2, we can see that the information of longitude, latitude has gone.

How to get those data with Python? Any advice would be appreciated

score 1 · Accepted Answer · answered Jun 05 '18 at 15:03

The information can be obtained if you provide a Referer header in your request as follows:

import requests
from bs4 import BeautifulSoup

url = 'http://182.148.109.184/enterprise-info!getCompanyInfo.action?companyid=1000356'
page = requests.get(url, headers={'Referer' : url})
soup = BeautifulSoup(page.text, 'html.parser')

table = soup.find(class_='formestyle')

for tr in table.find_all('tr'):
    row = [v.text for v in tr.find_all(['th', 'td'])]
    print(row)

This would display the following type of data:

['地理坐标：', '经度：104.2153 \xa0\xa0纬度：31.3631']

As you can see, the information is now present.

grabbing the content inside the html content using Python

Basic code

The information I want to collect

My question

1 Answers1