2

First of all, is it possible to do such thing?

I have been trying out to generate Xpath expression by using "sub-element text values" present in webpage. Trying to do this using lxml (etree, html, getpath), ElementTree modules in Python. But I don't know how to generate Xpath expression for a value present in the webpage. I totally know about Scrapy framework in python, but this is different.

Below is the my incomplete code..

import urllib2, re
from lxml import etree

def wgetUrl(target):
    try:
        req = urllib2.Request(target)
        req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3 Gecko/2008092417 Firefox/3.0.3')
        response = urllib2.urlopen(req)
        outtxt = response.read()
        response.close()
    except:
        return ''
    return outtxt


newUrl = 'http://www.iupui.edu/~webtrain/tutorials/tables.html' # homepage

dt = wgetUrl(newUrl)
parser = etree.HTMLParser()
tree   = etree.fromstring(dt, parser)

As per the lxml documentation they are creating element tree manually, but how can I use my read and parsed html data (in my example variable tree or data) to access the sub-element. Or more importantly, if possible the sub-element text value.

Let's say in the above example webpage, I want to search table "Supplies and Expenses" and generate Xpath expression dynamically by that value - Supplies and Expenses

Is there any option to do so !!! Ultimate goal, I would like to achieve is read webpage and generate Xpath for the sub-element text value present in webpage.

bosnjak
  • 8,424
  • 2
  • 21
  • 47
techDiscussion
  • 89
  • 3
  • 12
  • you want to generate xpath for elements with specific text? or you want a "general" xpath like `/html/body/div[6]/table/tbody/tr[1]/td/div/b` of given elements (maybe have some different text in other page) – Binux Dec 16 '14 at 12:19
  • I want to generate xpath for elements with specific text. If I've understood correctly, no need to worry to other page. – techDiscussion Dec 16 '14 at 13:31
  • the answer below is exactly you want, a xpath point to the element with specific text. if you want a xpath like `/html/...`, first locate the element (maybe with xpath in the answer below or traverse the tree. second trace back to root by yourself. – Binux Dec 16 '14 at 14:24

1 Answers1

3

To find all elements based on a part of their text value:

"//*[contains(text(), 'some_value')]"

For example, if you have this:

<div id="somediv">
    <span>Something is here</span>
    <a href="#">Click here</a>
</div>

You can find all sub-elements containing the word "here" like this:

"//div[@id='somediv']//*[contains(text(), 'here')]"

Or you can for example find all sub-div span elements containing the word "Something":

"//div[@id='somediv']//span[contains(text(), 'Something')]"

As for parsing this in lxml:

from lxml import etree
outtxt = response.read()
root = etree.fromstring(outtxt)
root.xpath("my_xpath_expression")

Update:

To get the full XPath expression for an element, use the ElementTree.getPath() method, like so:

tree = etree.ElementTree(root)
# this will print XPath of all
# elements in 'root'
for e in root.iter():
    print tree.getpath(e)
bosnjak
  • 8,424
  • 2
  • 21
  • 47
  • That's nice answer, but my question is let's say in your above example starting with _"here"_ can I generate Xpath expression for it all the way back till head of of html. Something like back-tracing the elements and generate Xpath expression. something like xpath for _here_: `/html/body/div[6]/table/tbody/tr[1]/td/div/b/text()` I hope I made myself clear. – techDiscussion Dec 16 '14 at 14:01
  • Yes, I understand. Check my answer again, I updated. – bosnjak Dec 16 '14 at 16:33
  • I followed your updates, but I am getting error `TypeError: Argument 'element' has incorrect type (expected lxml.etree._Element, got list)`. I updated my code just as your inputs, same till `root` and `tree`. Then I tried to select one subelement with text which is not at all repeated in the webpage by `subElem = root.xpath('//*[contains(text(), "V A R I E T Y")]')` or `print tree.getpath(root.xpath('//*[contains(text(), "V A R I E T Y")]'))` I am getting this error. Does this `[contains(text()..` return **LIST** object instead of **ElementTree** object which I want to _re-iterate_. – techDiscussion Dec 17 '14 at 09:09
  • Frankly, **+1** for your input. It's working and I can generate XPATH's for all the elements in the webpage. **Appreciates**. Issue for me always was generating XPATH for _specific sub-element_ (with the help of text value of that element) – techDiscussion Dec 17 '14 at 09:14
  • I'm not sure if this answers your question, or there is still something left unanswered? – bosnjak Dec 17 '14 at 12:23