Scraping -- Text element missing for
tag from JS generated page using PyQt4

Question

I'm trying to scrape this page using PyQt4 but for some reason the text elements for <dt> tags are not showing up when I search using BeautifulSoup.

I'm pretty new to using PyQt4 so I'm not sure what's going wrong here. I get all text elements for the text tags but nothing for . Is the page not fully loaded or what's going wrong? Any help is appreciated.

Here's the code I've been using so far:

class Client(QWebPage):
    def __init__(self, url):
         print('\n\nLoading: \n', url)
         self.app = QApplication(sys.argv)
         QWebPage.__init__(self)
         self.loadFinished.connect(self.on_page_load)
         self.mainFrame().load((QUrl(url)))
         self.app.exec()
         self.app.quit()

    def on_page_load(self):
        self.app.quit()

url = 'http://www.hkex.com.hk/Market-Data/Securities-Prices/Equities/Equities-Quote?sym=700&sc_lang=en'

client_response = Client(url)

source = client_response.mainFrame().toHtml()
soup = bs.BeautifulSoup(source, 'lxml')


table = soup.find('div', {'class' : 'left_list_leve quote'})

price =  soup.find('span' , {'class' : 'col_last'})
name = soup.find('p' , {'class' : 'col_name'})
all_dls = table.findAll('dl')

This is the result I get after running the script.

   Loading:
 http://www.hkex.com.hk/Market-Data/Securities-Prices/Equities/Equities-Quote?sym=700&sc_lang=en
[<dl>
<dd class="ico_name label_prevcls">PREV. CLOSE*</dd>
<dt class="ico_data col_prevcls"></dt>
</dl>, <dl>
<dd class="ico_name label_open">OPEN**</dd>
<dt class="ico_data col_open"></dt>
</dl>, <dl>
<dd class="ico_name label_turnover">TURNOVER</dd>
<dt class="ico_data col_turnover"></dt>
</dl>, <dl>
<dd class="ico_name label_volume">VOLUME</dd>
<dt class="ico_data col_volume"></dt>
</dl>, <dl>
<dd class="ico_name label_mktcap">MKT CAP</dd>
<dt class="ico_data col_mktcap"></dt>
</dl>, <dl>
<dd class="ico_name label_lotsize">LOT SIZE</dd>
<dt class="ico_data col_lotsize"></dt>
</dl>, <dl>
<dd class="ico_name label_bid">BID</dd>
<dt class="ico_data col_bid"></dt>
</dl>, <dl>
<dd class="ico_name label_ask">ASK</dd>
<dt class="ico_data col_ask"></dt>
</dl>, <dl>
<dd class="ico_name label_eps">EPS</dd>
<dt class="ico_data col_eps"></dt>
</dl>, <dl>
<dd class="ico_name label_pe">P/E</dd>
<dt class="ico_data col_pe"></dt>
</dl>, <dl>
<dd class="ico_name label_divyield">DIV YIELD</dd>
<dt class="ico_data col_divyield"></dt>
</dl>]
<span class="col_last"></span>
<p class="col_name"></p>

This id dynamic content which is generated by JavaScript. You need a tool like Selenium to get page content after JavaScript executed or make direct request to [this URL](https://www1.hkex.com.hk/hkexwidget/data/getequityquote?sym=700&token=evLtsLsBNAUVTPxtGqVeG8qfBLy+gjSsWl061OKuZ31G2i0I1fpRSa2hG7MJvctU&lang=eng&qid=1517742861135&callback=jQuery311006082724835119191_1517742853554&_=1517742853555) using, for example, python requests session — Andersson, Feb 04 '18 at 11:21
@Andersson selenium is not the only way to "to get page content after JavaScript executed" PyQt can do it as can DryScrape. See my answer to https://stackoverflow.com/questions/45259232/scraping-google-finance-beautifulsoup/45259523#45259523 — Dan-Dev, Feb 04 '18 at 14:02
@Andersson The OP asked for a PyQt solution or at least said that he was using it. You said "need a tool like Selenium" but he was using PyQt, misleading if you ask me. — Dan-Dev, Feb 04 '18 at 14:08
@Dan-Dev, and I don't deny this fact. I also didn't provide with **any** solutions - just advice in comments. I glad you got a solution in PyQt and I hope it'll be useful one — Andersson, Feb 04 '18 at 14:15

Dan-Dev · Accepted Answer · 2018-02-04T14:15:55.747

Your missing a _loadFinished() method.

# -*- coding: utf-8 -*-
from PyQt4 import QtCore, QtGui, QtWebKit
from PyQt4.QtGui import *
import bs4 as bs
import sys


class Client(QtWebKit.QWebPage):
    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QtWebKit.QWebPage.__init__(self)
        self.loadFinished.connect(self._loadFinished)
        self.mainFrame().load(QtCore.QUrl(url))
        self.app.exec_()

    def _loadFinished(self, result):
        self.frame = self.mainFrame()
        self.app.quit()


url = 'http://www.hkex.com.hk/Market-Data/Securities-Prices/Equities/Equities-Quote?sym=700&sc_lang=en'

client_response = Client(url)

source = client_response.frame.toHtml()
u =  (unicode(source).encode("utf-8", errors="replace"))
soup = bs.BeautifulSoup(u, 'lxml')

table = soup.find('div', {'class': 'left_list_leve quote'})

price = soup.find('span', {'class': 'col_last'})
name = soup.find('p', {'class': 'col_name'})
all_dls = table.findAll('dl')

for dl in all_dls:
    print (dl)

Outputs:

<dl>
<dd class="ico_name label_prevcls">PREV. CLOSE*</dd>
<dt class="ico_data col_prevcls">HK$460.000</dt>
</dl>
<dl>
<dd class="ico_name label_open">OPEN**</dd>
<dt class="ico_data col_open">HK$459.000</dt>
</dl>
<dl>
<dd class="ico_name label_turnover">TURNOVER</dd>
<dt class="ico_data col_turnover">HK$11.08B</dt>
</dl>
<dl>
<dd class="ico_name label_volume">VOLUME</dd>
<dt class="ico_data col_volume">24.33M</dt>
</dl>
<dl>
<dd class="ico_name label_mktcap">MKT CAP</dd>
<dt class="ico_data col_mktcap">HK$4,297.37B</dt>
</dl>
<dl>
<dd class="ico_name label_lotsize">LOT SIZE</dd>
<dt class="ico_data col_lotsize">100</dt>
</dl>
<dl>
<dd class="ico_name label_bid">BID</dd>
<dt class="ico_data col_bid">HK$452.400</dt>
</dl>
<dl>
<dd class="ico_name label_ask">ASK</dd>
<dt class="ico_data col_ask">HK$452.600</dt>
</dl>
<dl>
<dd class="ico_name label_eps">EPS</dd>
<dt class="ico_data col_eps">RMB4.383</dt>
</dl>
<dl>
<dd class="ico_name label_pe">P/E</dd>
<dt class="ico_data col_pe">91.55x</dt>
</dl>
<dl>
<dd class="ico_name label_divyield">DIV YIELD</dd>
<dt class="ico_data col_divyield">0.13%</dt>
</dl>

score 0 · Answer 2 · answered Feb 04 '18 at 13:45

Try using selenium:

from selenium import webdriver
import time

driver = webdriver.PhantomJS()
driver.get('http://www.hkex.com.hk/Market-Data/Securities-    Prices/Equities/Equities-Quote?sym=700&sc_lang=en')

content = driver.find_element_by_xpath('//*[@class="left_list_item list_item_op"]')
print(content.text)

Sample Output:

SIVABALANs-MBP:Desktop siva$ python test_phantomjs.py 
/Users/siva/anaconda3/lib/python3.6/site-packages/selenium/webdriver/phantomjs/webdriver.py:49: UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead
  warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless '
PREV. CLOSE*
HK$460.000
OPEN**
HK$459.000
TURNOVER
HK$11.08B
VOLUME
24.33M
MKT CAP
HK$4,297.37B
LOT SIZE
100
SIVABALANs-MBP:Desktop siva$

Scraping -- Text element missing for tag from JS generated page using PyQt4

2 Answers2

Scraping -- Text element missing for
tag from JS generated page using PyQt4