4

I use python3.5 and window10.

When I crawl some pages, I usually used url changes using urlopen and 'for' iteration. like below code.

from bs4 import BeautifulSoup

import urllib
f = open('Slave.txt','w')

for i in range(1,42):
 html = urllib.urlopen('http://xroads.virginia.edu/~hyper/JACOBS/hjch'+str(i)+'.htm')
 soup = BeautifulSoup(html,"lxml")
 text = soup.getText()
 f.write(text.encode("utf-8"))

f.close()

But, I am in trouble because there is no change in url, although I clicked next pages and web contentes were changed, like picture. there is no change in url and no pattern. enter image description here

There is no signal in url that i can catch the websites change.

http://eungdapso.seoul.go.kr/Shr/Shr01/Shr01_lis.jsp

The web site is here The clue I found was in pagination class. I found some links to go next pages, but i don't know how can i use this link in Beautifulsoup. I think commonPagingPost is defined function by developer.

<span class="number"><a href="javascript:;" 
class="on">1</a>&nbsp;&nbsp;
<a href="javascript:commonPagingPost('2','10','Shr01_lis.jsp');">2</a>&nbsp;&nbsp;
<a href="javascript:commonPagingPost('3','10','Shr01_lis.jsp');">3</a>&nbsp;&nbsp;
<a href="javascript:commonPagingPost('4','10','Shr01_lis.jsp');">4</a>&nbsp;&nbsp;
<a href="javascript:commonPagingPost('5','10','Shr01_lis.jsp');">5</a></span>

how can I open or crawl all these site using beutifulSoup4? I just get fisrt pages when i use urlopen.

김상엽
  • 41
  • 1
  • 2
  • Use the Inspect Element tool in your web browser, then do some network captures as you manually click on the page numbers. It is likely that the pagination is handled by HTTP POST requests. If you can glean what the payload of the POST request is, it is likely that you will be able to craft your request headers so that you can move through the numbered pages. – dagrha Feb 25 '16 at 22:11

2 Answers2

0

You won't be able to do this with beautifulsoup alone as it doesn't support ajax. You'll need to use something like selenium, ghost.py or other web browser with javascript support.

Using these libraries you'll be able to simulate a click on these links and then grab the newly loaded content.

bmcculley
  • 2,048
  • 1
  • 14
  • 17
0

I searched the code for commonPagingPost and found this JavaScript function definition:

function commonPagingPost (Page, Block, Action) {
                var Frm = document.mainForm;
                Frm.RCEPT_NO.value = "";
                Frm.page.value = Page;
                Frm.action = Action;
                Frm.submit ();
}

So what it does is that it fills out "mainForm" and submits it. What does mainForm look like?

<form name="mainForm" method="post" action="">
                <input type="hidden" name="RCEPT_NO" value="">
                <input type="hidden" name="search_flag" value="N">
                <input type="hidden" name="page" value="1">
</form>

Okay, the function fills out a form, sets the target page to 'Shr01_lis.jsp', the same page as you are trying to scrape. Can we do this in Python? Yes!

import requests

r = requests.post(
    "http://eungdapso.seoul.go.kr/Shr/Shr01/Shr01_lis.jsp",
    data={
        "RCEPT_NO": "",
        "search_flag": "N",
        "page": "5"
    })

soup = BeautifulSoup(r.text, 'lxml')

I prefer requests instead of urllib, because requests is simpler to work with for POST requests.

sahuk
  • 91
  • 1
  • 6