Say I need to crawler the detailed contents from this link:
The objective is to extract contents the elements from the link, and append all the entries as dataframe.
from bs4 import BeautifulSoup
import requests
import os
from urllib.parse import urlparse
url = 'http://www.jscq.com.cn/dsf/zc/cjgg/202101/t20210126_30144.html'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
text = soup.find_all(text=True)
output = ''
blacklist = [
'[document]',
'noscript',
'header',
'html',
'meta',
'head',
'input',
'script'
]
for t in text:
if t.parent.name not in blacklist:
output += '{} '.format(t)
print(output)
Out:
南京市玄武区锁金村10-30号房屋公开招租成交公告-成交公告-江苏产权市场
body{font-size:100%!important;}
.main_body{position:relative;width:1000px;margin:0 auto;background-color:#fff;}
.main_content_p img{max-width:90%;display:block;margin:0 auto;}
.m_con_r_h{padding-left: 20px;width: 958px;height: 54px;line-height: 55px;font-size: 12px;color: #979797;}
.m_con_r_h a{color: #979797;}
.main_content_p{min-height:200px;width:90%;margin:0 auto;line-height: 30px;text-indent:0;}
.main_content_p table{margin:0 auto!important;width:900px!important;}
.main_content_h1{border:none;width:93%;margin:0 auto;}
.tit_h{font-size:22px;font-family:'微软雅黑';color:#000;line-height:30px;margin-bottom:10px;padding-bottom:20px;text-align:center;}
.doc_time{font-size:12px;color:#555050;height:28px;line-height:28px;text-align:center;background:#F2F7FD;border-top:1px solid #dadada;}
.doc_time span{padding:0 5px;}
.up_dw{width:100%;border-top:1px solid #ccc;padding-top:10px;padding-bottom:10px;margin-top:30px;clear:both;}
.pager{width:50%;float:left;padding-left:0;text-align:center;}
.bshare-custom{position:absolute;top:20px;right:40px;}
.pager{width:90%;padding-left: 50px;float:inherit;text-align: inherit;}
页头部分开始
页头部分结束
START body
南京市玄武区锁金村10-30号房屋公开招租成交公告
组织机构:江苏省产权交易所
发布时间:2021-01-26
项目编号
17FCZZ20200125
转让/出租标的名称
南京市玄武区锁金村10-30号房屋公开招租
转让方/出租方名称
南京邮电大学资产经营有限责任公司
转让标的评估价/年租金评估价(元)
64800.00
转让底价/年租金底价(元)
97200.00
受让方/承租方名称
马尕西木
成交价/成交年租金(元)
97200.00
成交日期
2021年01月15日
附件:
END body
页头部分开始
页头部分结束
But how could I loop all the pages and extract contents, and append them to the following dataframe? Thanks.
Updates for appending dfs
as a dataframe:
updated_df = pd.DataFrame()
with requests.Session() as connection_session: # reuse your connection!
for follow_url in get_follow_urls(get_main_urls(), connection_session):
key = follow_url.rsplit("/")[-1].replace(".html", "")
# print(f"Fetching data for {key}...")
dfs = pd.read_html(
connection_session.get(follow_url).content.decode("utf-8"),
flavor="bs4",
)
# https://stackoverflow.com/questions/39710903/pd-read-html-imports-a-list-rather-than-a-dataframe
for df in dfs:
df = dfs[0].T.iloc[1:, :].copy()
updated_df = updated_df.append(df)
print(updated_df)
cols = ['项目编号', '转让/出租标的名称', '转让方/出租方名称', '转让标的评估价/年租金评估价(元)',
'转让底价/年租金底价(元)', '受让方/承租方名称', '成交价/成交年租金(元)', '成交日期']
updated_df.columns = cols
updated_df.to_excel('./data.xlsx', index = False)