Extract text() and get attributes from it

Question

I get an html tag with xpath, with conditions, and now i get the value with text(). Is there any way to get attributes from this value? (text())

Value from text()

document.write("<a href="http://www...">hello</a>");

Now i'll get the whole line (thats ok so far). And now i want so get the /@href from that value.

Here my code:

code = "...<script>document.write("<a href="http://www...">hello</a>"); </script>..."

doc = lxml.html.fromstring(code)
value = doc.xpath( "//script[contains(text(), 'document.write') and (contains(text(),'href'))]//text()" )

I can try it with regex, but maybe there is another good way to fix my problem with xpath.

Thanks

Is that your actual code? `document.write("hello");` looks like a syntax error to me. — Kevin, Dec 29 '14 at 12:55
I'm not referring to the "..." in your url, if that's what you mean. I'm referring to your use of double quotes inside a double quoted string, which is illegal syntax. Look at your code samples in this question. See how most of the lines are red? That's because the code formatting tool thinks that they're string literals. — Kevin, Dec 29 '14 at 13:39

score 3 · Accepted Answer · answered Dec 29 '14 at 13:01

3

You can avoid using regex by calling LH.fromstring on the text inside the <script> tag:

import lxml.html as LH
code = '...<script>document.write("<a href="http://www...">hello</a>"); </script>...'

doc = LH.fromstring(code)
for text in doc.xpath( "//script[contains(text(), 'document.write') and (contains(text(),'href'))]//text()" ):
    script = LH.fromstring(text)
    print(script.xpath('//a/@href'))

yields

['http://www...']

answered Dec 29 '14 at 13:01

unutbu

842,883
184
1,785
1,677

Cool, look like the solution i want. In the next step, i must extract within the document.write function. Problem: is not a valid tag that i extract like the tag. Any idea how i can do this with the same method / way? – user3507915 Dec 29 '14 at 13:18
Someone with more knowledge of JavaScript than I have may be able to show you how to handle this more robustly. From the Python side, all I can suggest is using regex or `str.replace` to convert `""` to ` – unutbu Dec 29 '14 at 14:16

Vivek Sable · Answer 2 · 2014-12-30T06:09:44.937

-1

We have to follow following steps to get href value of "a" tag from "script" tags:

Get text of the "script" tags by getiterator method.
Again create script_root for the text of "script" tag.
Find href attribute of the "a" tag by getiterator method.

>

code = """"<script>document.write("<a href="http://www...">hello</a>"); </script>"""
from lxml import html
root = html.fromstring(code)
for i in root.getiterator("script"):
    script_root = html.fromstring(i.text)
    for j in script_root.getiterator("a"):
        try:print "href:-", j.attrib["href"]
        except:pass

edited Dec 30 '14 at 06:09

answered Dec 29 '14 at 13:06

Vivek Sable

9,938
3
40
56

Posting code is not enough - please also explain what it does. Thanks! – Mathias Müller Dec 29 '14 at 16:26
@MathiasMüller: yes, explanation was missing, now added explanation. – Vivek Sable Dec 30 '14 at 06:13

Extract text() and get attributes from it

2 Answers2