using lxml to find the literal text of url links

Question

(Python 3.4.2) First off, I'm pretty new to python--more than a beginner but less than an intermediate user.

I'm trying to display the literal text of url's in a page by using lxml. I think I've ALMOST got it, but I'm missing something. I can get the actual url links, but not their titles.

Example--from this,

<a class="yt-uix-sessionlink yt-uix-tile-link  spf-link  yt-ui-ellipsis yt-ui-ellipsis-2" dir="ltr" aria-describedby="description-id-588180" data-sessionlink="ei=6t2FVJLtEsOWrAbQ24HYAg&amp;ved=CAcQvxs&amp;feature=c4-videos-u" href="/watch?v=I2AcJG4112A&amp;list=UUrtZO4nmCBN4C9ySmi013oA">Zombie on Omegle!</a>

I want to get this:

'Zombie on Omegle!'

(I'll make that html tag a little more readable for you guys)

<a class="yt-uix-sessionlink yt-uix-tile-link  spf-link  yt-ui-ellipsis yt-ui-ellipsis-2"
   dir="ltr" aria-describedby="description-id-588180"
   data-sessionlink="ei=6t2FVJLtEsOWrAbQ24HYAg&amp;ved=CAcQvxs&amp;feature=c4-videos-u"
   href="/watch?v=I2AcJG4112A&amp;list=UUrtZO4nmCBN4C9ySmi013oA">
       Zombie on Omegle!
</a>

I'm trying to do this from a YouTube page, and one of the problems is that YouTube doesn't specify a tag or an attribute for the titles of its links, if that makes sense.

Here's what I've tried:

import lxml.html
from lxml import etree
import urllib

url = 'https://www.youtube.com/user/makemebad35/videos'
response = urllib.request.urlopen(url)
content = response.read()
doc = lxml.html.fromstring(content)
tree = lxml.etree.HTML(content)
parser = etree.HTMLParser()

href_list = tree.xpath('//a/@href')
#Perfect. List of all url's under the 'href' attribute.
href_res = [lxml.etree.tostring(href) for href in href_list]
#^TypeError: Type 'lxml.etree._ElementUnicodeResult' cannot be serialized.

#So I tried extracting the 'a' tag without the attribute 'href'.
a_list = tree.xpath('//a')
a_res = [lxml.etree.tostring(clas) for clas in a_list]
#^This works.

links_fail = lxml.html.find_rel_links(doc,'href')
#^I named it 'links_fail because it doesn't work: the list is empty on output.
#   But the 'links_success' list below works.
urls = doc.xpath('//a/@href')
links_success = [link for link in urls if link.startswith('/watch')]
links_success
#^Out: ['/watch?v=K_yEaIBByFo&list=UUrtZO4nmCBN4C9ySmi013oA', ...]
#Awesome! List of all url's that begin with 'watch?v=..."
#Now only if I could get the titles of the links...

contents = [text.text_content() for text in urls if text.startswith('/watch')]
#^Empty list.

#I thought this paragraph below wouldn't work,
#   but I decided to try it anyway.
texts_fail = doc.xpath('//a/[@href="watch"]')
#^XPathEvalError: Invalid expression
#^Oops, I made a syntax error there. I forgot a '/' before 'watch'.
#    But after correcting it (below), the output is the same.
texts_fail = doc.xpath('//a/[@href="/watch"]')
#^XPathEvalError: Invalid expression
texts_false = doc.xpath('//a/@href="watch"')
texts_false
#^Out: False
#^Typo again. But again, the output is still the same.
texts_false = doc.xpath('//a/@href="/watch"')
texts_false
#^Out: False

target_tag = ''.join(('//a/@class=',
                        '"yt-uix-sessionlink yt-uix-tile-link  spf-link  ',
                        'yt-ui-ellipsis yt-ui-ellipsis-2"'))
texts_html = doc.xpath(target_tag)
#^Out: True
#But YouTube doesn't make attributes for link titles.
texts_tree = tree.xpath(target_tag)
#^Out: True

#I also tried this below, which I found in another stackoverflow question.
#It fails. The error is below.
doc_abs = doc.make_links_absolute(url)
#^Creates empty list, which is why the rest of this paragraph fails.
text = []
text_content = []
notText = []
hasText = []
for each in doc_abs.iter():
    if each.text:
        text.append(each.text)
        hasText.append(each)   # list of elements that has text each.text is true
    text_content.append(each.text_content()) #the text for all elements 
    if each not in hasText:
        notText.append(each)
#AttributeError                            Traceback (most recent call last)
#<ipython-input-215-38c68f560efe> in <module>()
#----> 1 for each in doc_abs.iter():
#      2     if each.text:
#      3         text.append(each.text)
#      4         hasText.append(each)   # list of elements that has text each.text is true
#      5     text_content.append(each.text_content()) #the text for all elements
#
#AttributeError: 'NoneType' object has no attribute 'iter'

I'm out of ideas. Anyone want to help this python padawan? :P

-----EDIT-----

I'm a step further, thanks to theSmallNothing. This command gets the text elements:

doc.xpath('//a/text()')

Unfortunately, that command returns a lot of whitespace and newlines ('\n') as values. I'll probably post another question later for that issue. If I do, I'll put a link to that question here in case anyone else with the same question ends up here.

How to use lxml to pair 'url links' with the 'names' of the links (eg. {name: link})

tsn · Accepted Answer · 2014-12-08T20:22:42.337

4

For your example you want to use the text selector in your xpath query:

doc.xpath('//a/text()')

which returns the text element of all the a elements it can find.

To get the href and text of all the a elements, which I think your trying to do, you can first extract all the a elements, then iterate and extract the href and text individually.

watch_els = []

els = doc.xpath('//a')
for el in els:
    text = el.xpath("//text()")
    href = el.xpath("//@href")
    #check text and href arrays are not empty...
    if len(href) <= 0 or len(text) <= 0:
        #empty text/href, skip.
        continue

    text = text[0]
    href = href[0]
    if "/watch?" in href:
        #do something with a youtube video link...
        watch_els.append((text, href))

edited Dec 08 '14 at 20:22

answered Dec 08 '14 at 18:30

tsn

838
9
20

Yes! Thank you! :D I thought it was going to be something simple like that. Is there a way to push this command so that it only shows the text elements of tags which have attributes beginning with 'href=/watch..."? I can't figure out lxml's regex syntax. – GreenRaccoon23 Dec 08 '14 at 18:38
No probs. Its not really regex, its actually completely different and you can look it up http://www.w3schools.com/xpath/ . Though I should warn you trying to select (extract elements) on their contents (ie the start of the href attribute) is tricky, you probably want to just use the format I used above with the 'for' loop except check the start of the url (href attribute) with python's startswith string method https://docs.python.org/2/library/stdtypes.html – tsn Dec 08 '14 at 18:46
Good idea. That sounds simple, but I can't get it to work. I'm getting a lot of values for doc.xpath('//a') that are just whitespace or '\n'. – GreenRaccoon23 Dec 08 '14 at 19:28
Here's what I've tried if you're curious. I'll probably post this as another question later though, cause it's a whole different issue. `import re texts = doc.xpath('//a/text()') texts_test = [] texts_test2 = [] for t in texts: texts_test.append(t.strip()) if re.findall('\S', t): texts_test2.append(t) urls = doc.xpath('//a/@href') len(texts_test) #263 len(texts_test2) #44 len(urls) #109` – GreenRaccoon23 Dec 08 '14 at 19:37
Yikes! Haha the newlines didn't work. Yeah, I'll definitely post this as another question later. I might as well post what else I've tried though. `from collections import defaultdict links_dic_pre = dict(zip(urls, str(texts))) links_dic = defaultdict() for key, value in links_dic_pre.items(): if key.startswith('https://www.youtube.com/watch'): links_dic[key] = value` – GreenRaccoon23 Dec 08 '14 at 19:39
Haha idk what your doing, I've updated the answer to show what I mean but like you said its probably better in a different question. Oh and if your satisfied with my answer please accept it, I like rep and it means the question isn't auto deleted in a month or so. – tsn Dec 08 '14 at 20:20
I think there is an error in the code: in both `el.xpath` rows `//` must be removed to work as intended – Sga May 31 '18 at 11:05

using lxml to find the literal text of url links

1 Answers1