0

Hi i get the above error. Why does it pop up, what am I missing and how do I get around it ? Thanks

try:
    import urllib.request as urllib2
except ImportError:
    import urllib2

from html2text import html2text

sock = html2text(urllib2.urlopen('http://www.example.com')) 
htmlSource = sock.read()                            
sock.close()                                        
print (htmlSource)

Im running IDLE 3.4.3 on a Windows 7 OS.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
user3115713
  • 19
  • 2
  • 7

3 Answers3

2

html2text expects the HTML code passed in as a string - read the response:

source = urllib2.urlopen('http://www.example.com').read()
text = html2text(source)
print(text)

It prints:

# Example Domain

This domain is established to be used for illustrative examples in documents.
You may use this domain in examples without prior coordination or asking for
permission.

[More information...](http://www.iana.org/domains/example)
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • @user3115713 please go through this and the previous questions you've asked and see if there are answers that need to or should be accepted. Thanks! – alecxe Jun 20 '15 at 03:17
1

I think I found a solution for Python 3.4. I just decoded source into UTF-8 and it worked.

#!/usr/bin/python

try:
    import urllib.request as urllib2
except ImportError:
    import urllib2

from html2text import html2text

source=urllib2.urlopen('http://www.example.com').read() 
s=html2text(source.decode("UTF-8"))

print (s)

Output

# Example Domain

This domain is established to be used for illustrative examples in documents.
You may use this domain in examples without prior coordination or asking for
permission.

[More information...](http://www.iana.org/domains/example)
Alex Ivanov
  • 695
  • 4
  • 6
  • Thanks it works now. I had figured python 3.x seem to have a little world of their own. For the solution why did it have to be decoded to utf-8 and how did you arrive to take such an approach. Thanks – user3115713 Jun 20 '15 at 07:41
0

Replace is an attribute for strings and you have a fileobject

obj=urllib2.urlopen('http://www.example.com')
print obj

.

<addinfourl at 3066852812L whose fp = <socket._fileobject object at 0xb6d267ec>>

This one is OK.

#!/usr/bin/python

try:
    import urllib.request as urllib2
except ImportError:
    import urllib2

from html2text import html2text


source=urllib2.urlopen('http://www.example.com').read() 
s=html2text(source)

print s

Output

This domain is established to be used for illustrative examples in documents.
You may use this domain in examples without prior coordination or asking for
permission.

[More information...](http://www.iana.org/domains/example
Alex Ivanov
  • 695
  • 4
  • 6
  • still getting a data = data.replace("' + 'script>", "") "TypeError: 'str' does not support the buffer interface" – user3115713 Jun 20 '15 at 03:43
  • That's weird. I don't. – Alex Ivanov Jun 20 '15 at 03:52
  • What version of python are you using as Im running a 3.x with IDLE on a windows 7 OS. ? – user3115713 Jun 20 '15 at 04:52
  • Python 2.7 on Ubuntu. And yes, I tried this script with Python 3.4, it gave an error "data = data.replace("' + 'script>", "") TypeError: expected bytes, bytearray or buffer compatible object". I think html2text module is not compatible with Python 3.4. "print (source)" works OK, it just fails to produce any text out of html. Weird. – Alex Ivanov Jun 20 '15 at 05:16