How to use Pandas read_html and requests library to read the table?

Question

How can I scrape the prices of a fund in:

http://www.prudential.com.hk/PruServlet?module=fund&purpose=searchHistFund&fundCd=JAS_U

It is wrong but how do I modify it:

import pandas as pd
import requests
import re
url = 'http://www.prudential.com.hk/PruServlet?module=fund&purpose=searchHistFund&fundCd=JAS_U'
tables = pd.read_html(requests.get(url).text, attrs={"class":re.compile("fundPriceCell\d+")})

This is quite a messy html, I think you're going to need to explore the xml tree to grab the correct values. The attr classes should be on the table rather than the cells (I think)... — Andy Hayden, Nov 14 '13 at 19:04
I'm sorry. Does that mean I have to import BeautifulSoup4? Any recommendation? — Terence Ng, Nov 15 '13 at 03:21
Disclaimer: I could be wrong, and there could be a neat way to get read_html to grab this. If not, I was envisioning something like this: http://stackoverflow.com/a/16993660/1240268, but it's a bit messy/awkward. — Andy Hayden, Nov 15 '13 at 03:54

score 2 · Accepted Answer · answered Dec 06 '13 at 17:00

2

I like lxml for parsing and querying HTML. Here's what I came up with:

import requests
from lxml import etree

url = 'http://www.prudential.com.hk/PruServlet?module=fund&purpose=searchHistFund&fundCd=JAS_U'
doc = requests.get(url)
tree = etree.HTML(doc.content)

row_xpath = '//tr[contains(td[1]/@class, "fundPriceCell")]'

rows = tree.xpath(row_xpath)

for row in rows:
    (date_string, v1, v2) = (td.text for td in row.getchildren())
    print "%s - %s - %s" % (date_string, v1, v2)

answered Dec 06 '13 at 17:00

brechin

569
4
7

What is "td" in the code? – Egret Sep 20 '21 at 08:40
@Egret - In this example, a generator expression is used to iterate over each row's (i.e. table row, ``) children, which we expect to be table data (i.e. ``) elements. – brechin Sep 22 '21 at 18:59

score 1 · Answer 2 · answered Dec 13 '13 at 02:58

My solution is similar to yours:

import pandas as pd
import requests
from lxml import etree

url = "http://www.prudential.com.hk/PruServlet?module=fund&purpose=searchHistFund&fundCd=JAS_U"
r = requests.get(url)
html = etree.HTML(r.content)
data = html.xpath('//table//table//table//table//td[@class="fundPriceCell1" or @class="fundPriceCell2"]//text()')

if len(data) % 3 == 0:
    df = pd.DataFrame([data[i:i+3] for i in range(0, len(data), 3)], columns = ['date', 'bid', 'ask'])
    df = df.set_index('date')
    df.index = pd.to_datetime(df.index, format = '%d/%m/%Y')
    df.sort_index(inplace = True)

How to use Pandas read_html and requests library to read the table?

2 Answers2