Web scraping a text() in python

Question

I am having trouble with a web scraping function. The XPath for the two things I am trying to get are

/html/body/div/table[2]/tbody/tr[5]/td[1]/div[1]/ul/li[1]/text()
/html/body/div/table[2]/tbody/tr[5]/td[1]/div[1]/ul/li[1]/a

The html is

<li><a href="http://www.acu.edu/" target="_blank" class="institution">Abilene Christian University</a> (TX)</li>

I am trying to have a function to loop through each li in tr[5]. The problem I am having is getting the text(). I have tried a number of different variations of this function

from lxml.html import parse
from urllib2 import urlopen
def _clean(lst):
    for elm in lst:
        lnk=elm.findall('.//a')
        for this in lnk:
            lnk_txt.append(this.text_content())
        state_txt.append(elm.findall('.//text()'))

This specific function returns an KeyError on the '()'. If I remove (), it returns a list of empty elements. The lnk_txt works.

What I am trying to get are two list. One is the name of the University. The other is the location of the University. The ultimate goal is to make tuples (name, state).

It is the (TX). I added the sample and my packages to the post — lost, Sep 18 '15 at 14:52

score 2 · Accepted Answer · answered Sep 18 '15 at 15:05

2

You need to find the following text sibling of the a element:

lnk.xpath("following-sibling::text()")

Demo:

>>> import lxml.html
>>> data = '<li><a href="http://www.acu.edu/" target="_blank" class="institution">Abilene Christian University</a> (TX)</li>'
>>> li = lxml.html.fromstring(data)
>>> li.xpath("//a[@class='institution']/following-sibling::text()")[0].strip()
'(TX)'

answered Sep 18 '15 at 15:05

alecxe

462,703
120
1,088
1,195

Thank you it worked. Is there a resource you used for the answer or did you know it from experience? – lost Sep 18 '15 at 15:11
1

@lost I would say this is a specific skill "locating elements in the html". Study xpath syntax, css selectors - there is a lot of information out there on the web. But, I would say, practice and practice more. – alecxe Sep 18 '15 at 15:21

Web scraping a text() in python

1 Answers1