1

Say I need to crawler the detailed contents from this link:

enter image description here

The objective is to extract contents the elements from the link, and append all the entries as dataframe.

from bs4 import BeautifulSoup
import requests
import os
from urllib.parse import urlparse

url = 'http://www.jscq.com.cn/dsf/zc/cjgg/202101/t20210126_30144.html'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
text = soup.find_all(text=True)

output = ''
blacklist = [
    '[document]',
    'noscript',
    'header',
    'html',
    'meta',
    'head', 
    'input',
    'script'
]

for t in text:
    if t.parent.name not in blacklist:
        output += '{} '.format(t)
print(output)

Out:

南京市玄武区锁金村10-30号房屋公开招租成交公告-成交公告-江苏产权市场 
body{font-size:100%!important;}
.main_body{position:relative;width:1000px;margin:0 auto;background-color:#fff;}
.main_content_p img{max-width:90%;display:block;margin:0 auto;}
.m_con_r_h{padding-left: 20px;width: 958px;height: 54px;line-height: 55px;font-size: 12px;color: #979797;}
.m_con_r_h a{color: #979797;}
.main_content_p{min-height:200px;width:90%;margin:0 auto;line-height: 30px;text-indent:0;}
.main_content_p table{margin:0 auto!important;width:900px!important;}
.main_content_h1{border:none;width:93%;margin:0 auto;}
.tit_h{font-size:22px;font-family:'微软雅黑';color:#000;line-height:30px;margin-bottom:10px;padding-bottom:20px;text-align:center;}
.doc_time{font-size:12px;color:#555050;height:28px;line-height:28px;text-align:center;background:#F2F7FD;border-top:1px solid #dadada;}
.doc_time span{padding:0 5px;}
.up_dw{width:100%;border-top:1px solid #ccc;padding-top:10px;padding-bottom:10px;margin-top:30px;clear:both;}
.pager{width:50%;float:left;padding-left:0;text-align:center;}

.bshare-custom{position:absolute;top:20px;right:40px;}
.pager{width:90%;padding-left: 50px;float:inherit;text-align: inherit;}
 页头部分开始 
 页头部分结束 
  START body  
 南京市玄武区锁金村10-30号房屋公开招租成交公告 
 组织机构:江苏省产权交易所 
 发布时间:2021-01-26  
 项目编号 
 17FCZZ20200125 
 转让/出租标的名称 
 南京市玄武区锁金村10-30号房屋公开招租 
 转让方/出租方名称 
 南京邮电大学资产经营有限责任公司 
 转让标的评估价/年租金评估价(元) 
 64800.00 
 转让底价/年租金底价(元) 
 97200.00 
 受让方/承租方名称 
 马尕西木 
 成交价/成交年租金(元) 
 97200.00 
 成交日期 
 2021年01月15日 
 附件: 
  END body  
 页头部分开始 
 页头部分结束 

But how could I loop all the pages and extract contents, and append them to the following dataframe? Thanks.

enter image description here

enter image description here

enter image description here

Updates for appending dfs as a dataframe:

updated_df = pd.DataFrame()

with requests.Session() as connection_session:  # reuse your connection!
    for follow_url in get_follow_urls(get_main_urls(), connection_session):
        key = follow_url.rsplit("/")[-1].replace(".html", "")
        # print(f"Fetching data for {key}...")
        dfs = pd.read_html(
            connection_session.get(follow_url).content.decode("utf-8"),
            flavor="bs4",
        )
        # https://stackoverflow.com/questions/39710903/pd-read-html-imports-a-list-rather-than-a-dataframe
        for df in dfs:
            df = dfs[0].T.iloc[1:, :].copy()
            updated_df = updated_df.append(df)
            print(updated_df)

cols = ['项目编号', '转让/出租标的名称', '转让方/出租方名称', '转让标的评估价/年租金评估价(元)', 
        '转让底价/年租金底价(元)', '受让方/承租方名称', '成交价/成交年租金(元)', '成交日期']
updated_df.columns = cols
updated_df.to_excel('./data.xlsx', index = False)
ah bon
  • 9,293
  • 12
  • 65
  • 148

1 Answers1

2

Here's how I would do this:

  1. build all main urls
  2. visit every main page
  3. get the follow urls
  4. visit each follow url
  5. grab the table from the follow url
  6. parse the table with pandas
  7. add the table to a dictionary of pandas dataframes
  8. process the tables (not included -> implement your logic)

repeat the 2 - 7 steps to continue scraping the data.

The code:

import pandas as pd
import requests
from bs4 import BeautifulSoup

BASE_URL = "http://www.jscq.com.cn/dsf/zc/cjgg"


def get_main_urls() -> list:
    start_url = f"{BASE_URL}/index.html"
    return [start_url] + [f"{BASE_URL}/index_{i}.html" for i in range(1, 6)]


def get_follow_urls(urls: list, session: requests.Session()) -> iter:
    for url in urls[:1]:  # remove [:1] to scrape all the pages
        body = session.get(url).content
        s = BeautifulSoup(body, "lxml").find_all("td", {"width": "60%"})
        yield from [f"{BASE_URL}{a.find('a')['href'][1:]}" for a in s]


dataframe_collection = {}

with requests.Session() as connection_session:  # reuse your connection!
    for follow_url in get_follow_urls(get_main_urls(), connection_session):
        key = follow_url.rsplit("/")[-1].replace(".html", "")
        print(f"Fetching data for {key}...")
        df = pd.read_html(
            connection_session.get(follow_url).content.decode("utf-8"),
            flavor="bs4",
        )
        dataframe_collection[key] = df

    # process the dataframe_collection here

# print the dictionary of dataframes (optional and can be removed)
for key in dataframe_collection.keys():
    print("\n" + "=" * 40)
    print(key)
    print("-" * 40)
    print(dataframe_collection[key])

Output:

Fetching data for t20210311_30347...
Fetching data for t20210311_30346...
Fetching data for t20210305_30338...
Fetching data for t20210305_30337...
Fetching data for t20210303_30323...
Fetching data for t20210225_30306...
Fetching data for t20210225_30305...
Fetching data for t20210225_30304...
Fetching data for t20210225_30303...
Fetching data for t20210209_30231...

and then ...

enter image description here

baduker
  • 19,152
  • 9
  • 33
  • 56
  • 1
    FYI it's __scraping__ not scrapping – DisappointedByUnaccountableMod Mar 17 '21 at 13:25
  • Good catch, I've fixed the typo. :] – baduker Mar 17 '21 at 13:26
  • Many thanks, I updated the code in OP in order to append the output as dataframe, but it raises an error: `ValueError: Length mismatch`, pls check. – ah bon Mar 17 '21 at 15:37
  • 1
    @ahbon you have `8` rows in the table but you're adding `10` columns. Hence, the mismatch. – baduker Mar 17 '21 at 15:42
  • Sorry my mistake, one more question, what does `->` represent in Python? I haven't found any tuturial regarding it. – ah bon Mar 18 '21 at 12:26
  • 1
    Take a look at [->](https://docs.python.org/3/library/typing.html). – baduker Mar 18 '21 at 12:28
  • One more question, with code I updated, it only saves first page's contents to `data.xlsx`. Could you please test? – ah bon Mar 18 '21 at 12:45
  • 1
    @ahbon move the df processing out of the scraping `for loop`. In other words, unindent it. – baduker Mar 18 '21 at 12:54
  • 1
    Also check this line `dfs[0].T.iloc[1:, :].copy()` because you keep modifying the first df only. Finally, this is way too much work to get this done via the comments. I'd suggest asking a new question. – baduker Mar 18 '21 at 13:49
  • I ask a new question: https://stackoverflow.com/questions/66702997/only-crawler-the-first-page-and-save-detailed-contents-as-dataframe-in-python. Pls check. – ah bon Mar 19 '21 at 05:36