1

I get an html tag with xpath, with conditions, and now i get the value with text(). Is there any way to get attributes from this value? (text())

Value from text()

document.write("<a href="http://www...">hello</a>"); 

Now i'll get the whole line (thats ok so far). And now i want so get the /@href from that value.

Here my code:

code = "...<script>document.write("<a href="http://www...">hello</a>"); </script>..."

doc = lxml.html.fromstring(code)
value = doc.xpath( "//script[contains(text(), 'document.write') and (contains(text(),'href'))]//text()" )

I can try it with regex, but maybe there is another good way to fix my problem with xpath.

Thanks

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
user3507915
  • 279
  • 3
  • 15
  • 2
    Is that your actual code? `document.write("hello");` looks like a syntax error to me. – Kevin Dec 29 '14 at 12:55
  • Nope, i cut off some not relevant parts – user3507915 Dec 29 '14 at 13:30
  • 1
    I'm not referring to the "..." in your url, if that's what you mean. I'm referring to your use of double quotes inside a double quoted string, which is illegal syntax. Look at your code samples in this question. See how most of the lines are red? That's because the code formatting tool thinks that they're string literals. – Kevin Dec 29 '14 at 13:39

2 Answers2

3

You can avoid using regex by calling LH.fromstring on the text inside the <script> tag:

import lxml.html as LH
code = '...<script>document.write("<a href="http://www...">hello</a>"); </script>...'

doc = LH.fromstring(code)
for text in doc.xpath( "//script[contains(text(), 'document.write') and (contains(text(),'href'))]//text()" ):
    script = LH.fromstring(text)
    print(script.xpath('//a/@href'))

yields

['http://www...']
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
-1

We have to follow following steps to get href value of "a" tag from "script" tags:

  1. Get text of the "script" tags by getiterator method.
  2. Again create script_root for the text of "script" tag.
  3. Find href attribute of the "a" tag by getiterator method.

>

code = """"<script>document.write("<a href="http://www...">hello</a>"); </script>"""
from lxml import html
root = html.fromstring(code)
for i in root.getiterator("script"):
    script_root = html.fromstring(i.text)
    for j in script_root.getiterator("a"):
        try:print "href:-", j.attrib["href"]
        except:pass
Vivek Sable
  • 9,938
  • 3
  • 40
  • 56