2

With this python code i can get whole html source

import mechanize
import lxml.html
import StringIO

br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [("User-agent","Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.13) Gecko/20101206 Ubuntu/10.10 (maverick) Firefox/3.6.13")]
sign_in = br.open("http://target.co.uk")
#the login url
br.select_form(nr = 0) 
#accessing form by their index.
#Since we have only one form in this example, nr =0.
br.select_form(nr=0)
#Alternatively you may use this instead of the above line 
#if your form has name attribute available.
br["username"] = "myusername"
#the key "username" is the variable that takes the username/email value
br["password"] = "myp4sw0rd"
#the key "password" is the variable that takes the password value
logged_in = br.submit()   
#submitting the login credentials
logincheck = logged_in.read()
#reading the page body that is redirected after successful login
if "logout" in logincheck:
    print "Login success, you just logged in."
else:
    print "Login failed"
#printing the body of the redirected url after login
coding1_content = br.open("https://www.target.co.uk/levels/coding/1").read() 
#accessing other url(s) after login is done this way


tree = lxml.html.parse(io.StringIO(coding1_content)

for ta in tree.findall("//textarea"):
    if not ta.get("name"):
        print(ta.text)

if "textarea" in coding1_content:
    print "Textarea found."
else:
    print "Textarea not found."

but what i need is get content of first textarea tag which dont have name, my html source is like below

........
........
<textarea>this, is, what, i, want</textarea>
<textarea name="answer">i don't need it</textarea>
........
........

any help will be appreciated

Dark Cyber
  • 2,181
  • 7
  • 44
  • 68

3 Answers3

1

According to the lxml documentation you can access the forms of a html-object by accessing the forms property:

form_page = fromstring('''some html code with a <form>''')
form = form_page.forms[0] # to get the first form
form.fields # these are the fields

see more here: http://lxml.de/lxmlhtml.html -> Forms

faebser
  • 139
  • 1
  • 11
  • actually the html source come from coding1_content = br.open("https://www.target.co.uk/protected/page/1").read() not file, so i need to open url and get the source then scrape textarea content – Dark Cyber Apr 08 '15 at 10:43
  • If you manage to read the html page into a string and get i parsed by lxml you can access the form. – faebser Apr 08 '15 at 10:45
  • you could also use https://en.wikipedia.org/wiki/Beautiful_Soup which is built on top of lxml and is a special html parser. – faebser Apr 08 '15 at 10:45
  • form_page = lxml.html.fromstring(coding1_content) form = form_page.forms[0] print form.fields[0].value just trying above code and i get KeyError: 'No input element with the name 0' , actually the first textarea dont have any attribute so its like , where is my mistake bro ? – Dark Cyber Apr 08 '15 at 10:55
  • maybe there is no form surrounding it. you could also try out http://lxml.de/cssselect.html to use a css selector or the xpath that the other guys mentioned. – faebser Apr 08 '15 at 12:10
0

If the HTML is

<html>
  <body>
    <form>
      <textarea>this, is, what, i, want</textarea>
      <textarea name="answer">i don't need it</textarea>
    </form>
  </body>
</html>

you can get the textarea content like this:

import io
import lxml.html

html = "..."
tree = lxml.html.parse(io.StringIO(html)
for ta in tree.findall("//textarea"):
    if not ta.get("name"):
        print(ta.text)

Output:

this, is, what, i, want
  • actually the html source come from coding1_content = br.open("https://www.target.co.uk/protected/page/1").read() not file, so i need to open url and get the source then scrape textarea content – Dark Cyber Apr 08 '15 at 10:43
  • You can use `io.StringIO` to parse from a string. –  Apr 08 '15 at 11:05
  • i did your code but i get error shown in this picture http://prntscr.com/6r6jlb, whats wrong bro ? – Dark Cyber Apr 08 '15 at 14:58
  • Please [edit your question](http://stackoverflow.com/posts/29512047/edit) and include the complete code of `alpha2.py`. –  Apr 08 '15 at 15:09
  • Your indention is wrong. The line `print(ta.text)` must be indented *two* levels. –  Apr 08 '15 at 19:01
  • just indented like you said but still error bro like my picture. looks like weird error. may i chat you bro ? – Dark Cyber Apr 08 '15 at 20:16
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/74784/discussion-between-dark-cyber-and-tichodroma). – Dark Cyber Apr 09 '15 at 05:10
0

Another possible way to get all <textarea> not having HTML attribute name, that is using xpath() method :

.....
for t in tree.xpath(".//textarea[not(@name)]"):
    print t.text

while findall() only support subsets of the XPath language, xpath() has full XPath 1.0 support. For example, as demonstrated in this particular case, xpath() supports not() and findall() doesn't.

har07
  • 88,338
  • 12
  • 84
  • 137