First of all, is it possible to do such thing?
I have been trying out to generate Xpath expression by using "sub-element text values" present in webpage. Trying to do this using lxml (etree, html, getpath), ElementTree modules in Python. But I don't know how to generate Xpath expression for a value present in the webpage. I totally know about Scrapy framework in python, but this is different.
Below is the my incomplete code..
import urllib2, re
from lxml import etree
def wgetUrl(target):
try:
req = urllib2.Request(target)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3 Gecko/2008092417 Firefox/3.0.3')
response = urllib2.urlopen(req)
outtxt = response.read()
response.close()
except:
return ''
return outtxt
newUrl = 'http://www.iupui.edu/~webtrain/tutorials/tables.html' # homepage
dt = wgetUrl(newUrl)
parser = etree.HTMLParser()
tree = etree.fromstring(dt, parser)
As per the lxml documentation they are creating element tree manually, but how can I use my read and parsed html data (in my example variable tree
or data
) to access the sub-element. Or more importantly, if possible the sub-element text value.
Let's say in the above example webpage, I want to search table "Supplies and Expenses" and generate Xpath expression dynamically by that value - Supplies and Expenses
Is there any option to do so !!! Ultimate goal, I would like to achieve is read webpage and generate Xpath for the sub-element text value present in webpage.