2

I want to crawl the link: http://data.eastmoney.com/hsgt/index.html

But I found the XHR documents are all without data, but EventSteam, so how can I crawl the complete information of the page.

For example, I want to crawl -94.67亿元 on the page.

my code is below:

import requests
import pandas as pd
from pyquery import PyQuery
from lxml import etree
import time

response = requests.get(url='http://data.eastmoney.com/hsgt/index.html',
                        headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'})
response.encoding = 'GB2312'

# this shows False
'-94.67' in response.text

I then try to install dryscape but failed, it said I have no web server file.

Many thanks for the help.

Wei Zhang
  • 47
  • 4

1 Answers1

2

As you mention the XHR requests, managed by the javascript running in the client, aren't being executed. This is down to the fact that the requests` package doesn't execute javascript and isn't trying to mimic a web browser. You should look into an alternative approach. There are quite a lot. You have many options, and I'd suggest you reading pages like the following for more context on the problem.

And, additionally, maybe look at something like dryscrape. I haven't used it myself, by it seems like something akin to this

import dryscrape

sess = dryscrape.Session()
sess.visit('http://data.eastmoney.com/hsgt/index.html')
source = sess.body()

is what you are after. Have fun.

JustDanyul
  • 13,813
  • 7
  • 53
  • 71