2

How can I scrape the prices of a fund in:

http://www.prudential.com.hk/PruServlet?module=fund&purpose=searchHistFund&fundCd=JAS_U

It is wrong but how do I modify it:

import pandas as pd
import requests
import re
url = 'http://www.prudential.com.hk/PruServlet?module=fund&purpose=searchHistFund&fundCd=JAS_U'
tables = pd.read_html(requests.get(url).text, attrs={"class":re.compile("fundPriceCell\d+")})
Terence Ng
  • 442
  • 2
  • 8
  • 19
  • This is quite a messy html, I think you're going to need to explore the xml tree to grab the correct values. The attr classes should be on the table rather than the cells (I think)... – Andy Hayden Nov 14 '13 at 19:04
  • I'm sorry. Does that mean I have to import BeautifulSoup4? Any recommendation? – Terence Ng Nov 15 '13 at 03:21
  • Disclaimer: I could be wrong, and there could be a neat way to get read_html to grab this. If not, I was envisioning something like this: http://stackoverflow.com/a/16993660/1240268, but it's a bit messy/awkward. – Andy Hayden Nov 15 '13 at 03:54

2 Answers2

2

I like lxml for parsing and querying HTML. Here's what I came up with:

import requests
from lxml import etree

url = 'http://www.prudential.com.hk/PruServlet?module=fund&purpose=searchHistFund&fundCd=JAS_U'
doc = requests.get(url)
tree = etree.HTML(doc.content)

row_xpath = '//tr[contains(td[1]/@class, "fundPriceCell")]'

rows = tree.xpath(row_xpath)

for row in rows:
    (date_string, v1, v2) = (td.text for td in row.getchildren())
    print "%s - %s - %s" % (date_string, v1, v2)
brechin
  • 569
  • 4
  • 7
  • What is "td" in the code? – Egret Sep 20 '21 at 08:40
  • @Egret - In this example, a generator expression is used to iterate over each row's (i.e. table row, ``) children, which we expect to be table data (i.e. ``) elements. – brechin Sep 22 '21 at 18:59
1

My solution is similar to yours:

import pandas as pd
import requests
from lxml import etree

url = "http://www.prudential.com.hk/PruServlet?module=fund&purpose=searchHistFund&fundCd=JAS_U"
r = requests.get(url)
html = etree.HTML(r.content)
data = html.xpath('//table//table//table//table//td[@class="fundPriceCell1" or @class="fundPriceCell2"]//text()')

if len(data) % 3 == 0:
    df = pd.DataFrame([data[i:i+3] for i in range(0, len(data), 3)], columns = ['date', 'bid', 'ask'])
    df = df.set_index('date')
    df.index = pd.to_datetime(df.index, format = '%d/%m/%Y')
    df.sort_index(inplace = True)
Terence Ng
  • 442
  • 2
  • 8
  • 19