0

I'm trying to loop pages, crawler and save detailed contents from this link:

Based on the code from here, I've modified the code to:

import pandas as pd
import requests
from bs4 import BeautifulSoup

BASE_URL = "http://www.jscq.com.cn/dsf/zc/cjgg"

def get_main_urls() -> list:
    start_url = f"{BASE_URL}/index.html"
    return [start_url] + [f"{BASE_URL}/index_{i}.html" for i in range(1, 6)]


def get_follow_urls(urls: list, session: requests.Session()) -> iter:
    for url in urls[:1]:  # remove [:1] to scrape all the pages
        body = session.get(url).content
        s = BeautifulSoup(body, "lxml").find_all("td", {"width": "60%"})
        yield from [f"{BASE_URL}{a.find('a')['href'][1:]}" for a in s]

updated_df = pd.DataFrame()

with requests.Session() as connection_session:  # reuse your connection!
    for follow_url in get_follow_urls(get_main_urls(), connection_session):
        key = follow_url.rsplit("/")[-1].replace(".html", "")
        # print(f"Fetching data for {key}...")
        dfs = pd.read_html(
            connection_session.get(follow_url).content.decode("utf-8"),
            flavor="bs4")
        # https://stackoverflow.com/questions/39710903/pd-read-html-imports-a-list-rather-than-a-dataframe
        for df in dfs:
            # df = dfs[0].T
            df = dfs[0].T.iloc[1:, :].copy()
            updated_df = updated_df.append(df)
            print(updated_df)
    
    cols = ['项目编号', '转让/出租标的名称', '转让方/出租方名称', '转让标的评估价/年租金评估价(元)', 
            '转让底价/年租金底价(元)', '受让方/承租方名称', '成交价/成交年租金(元)', '成交日期']
    updated_df.columns = cols
    updated_df.to_excel('./data.xlsx', index = False)

But it only successfully crawler the first page, how could I crawler all the pages and also add url column? Thanks.

ah bon
  • 9,293
  • 12
  • 65
  • 148
  • make an initial request to requests and bs4 and extract `.pagenav_span font font` to get number of pages then loop generating the new url to insert current page number into url? `f'http://www.jscq.com.cn/dsf/zc/cjgg/index_{page_number}.html'` – QHarr Mar 19 '21 at 07:21
  • Sorry, i didn't understand your question, could you pls paste your code? – ah bon Mar 19 '21 at 08:01

1 Answers1

1

Is this what you're looking for? This processes all the urls and dumps a list of dataframes to a single excel file.

Here's how:

import pandas as pd
import requests
from bs4 import BeautifulSoup

BASE_URL = "http://www.jscq.com.cn/dsf/zc/cjgg"
COLUMNS = [
    '项目编号', '转让/出租标的名称', '转让方/出租方名称',
    '转让标的评估价/年租金评估价(元)', '转让底价/年租金底价(元)',
    '受让方/承租方名称', '成交价/成交年租金(元)', '成交日期', 'URL'
]


def get_main_urls() -> list:
    start_url = f"{BASE_URL}/index.html"
    return [start_url] + [f"{BASE_URL}/index_{i}.html" for i in range(1, 6)]


def get_follow_urls(urls: list, session: requests.Session()) -> iter:
    for url in urls:
        body = session.get(url).content
        s = BeautifulSoup(body, "lxml").find_all("td", {"width": "60%"})
        yield from [f"{BASE_URL}{a.find('a')['href'][1:]}" for a in s]


def post_process(list_of_dataframes: list, source_url: str) -> pd.DataFrame():
    temp_df = list_of_dataframes[0]
    temp_df = temp_df.append(
        pd.Series(["URL", source_url], index=temp_df.columns),
        ignore_index=True,
    )
    return temp_df.T.iloc[1:, :].copy()


def dump_to_excel(post_processed_dfs: list):
    df = pd.concat(post_processed_dfs)
    df.columns = COLUMNS
    df.to_excel("scraped_data.xlsx", index=False)


processed_dfs = []
with requests.Session() as connection_session:  # reuse your connection!
    for follow_url in get_follow_urls(get_main_urls(), connection_session):
        key = follow_url.rsplit("/")[-1].replace(".html", "")
        print(f"Fetching data for {key}...")
        df_list = pd.read_html(
            connection_session.get(follow_url).content.decode("utf-8"),
            flavor="bs4",
        )
        processed_dfs.append(post_process(df_list, follow_url))
    dump_to_excel(processed_dfs)

Output:

enter image description here

baduker
  • 19,152
  • 9
  • 33
  • 56
  • Yes, exactly, could we add a `url` column to save the link for each item? For example: http://www.jscq.com.cn/dsf/zc/cjgg/202103/t20210311_30347.html, etc. – ah bon Mar 19 '21 at 08:44
  • Sure, that can be done. What should the column be named in the target language (Chinese, right?)? – baduker Mar 19 '21 at 08:46
  • 1
    Btw, it's seems not easy to understand your code, could you pls add explanations? – ah bon Mar 19 '21 at 08:46
  • 1
    Using `url_link` as column name is ok. – ah bon Mar 19 '21 at 08:47
  • 1
    @ahbon Check the updated answer. I'll add comment later. Meanwhile, study the code and try to figure it out on your own. :) – baduker Mar 19 '21 at 09:17
  • No problem, glad to help. PS. You can always [buy me a coffee](https://www.buymeacoffee.com/baduker) if you feel like saying thank you. – baduker Mar 19 '21 at 10:03
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/230120/discussion-between-ahbon-and-baduker). – ah bon Mar 19 '21 at 10:09