1

Does anyone have any experience with this?

I have been using python 3.2 for the last half a year, and my memory of 2.6.2 is not that great.

On my computer the following code works, tested using 2.6.1:

import contextlib
import codecs

def readfile(path):
    with contextlib.closing( codecs.open( path, 'r', 'utf-8' )) as f:
        for line in f:
            yield line

path = '/path/to/norsk/verbs.txt'

for i in readfile(path):
    print i

but on the phone it gets to the first special character ø and throws:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in position 3: ordinal not in range(128)

any ideas as I am going to need to input them as well as read form a file?

beoliver
  • 5,579
  • 5
  • 36
  • 72

2 Answers2

2

Printing is an I/O operation. I/O requires bytes. What you have in i is unicode, or characters. Characters only convert directly to bytes when we're talking about ascii, but on your phone you have encountered a non-ascii character (u'\xf8' is ø). To convert characters to bytes, you need to encode them.

import contextlib
import codecs

def readfile(path):
    with contextlib.closing( codecs.open( path, 'r', 'utf-8' )) as f:
        for line in f:
            yield line

path = '/path/to/norsk/verbs.txt'

for i in readfile(path):
    print i.encode('utf8')

As to why this works on your code works on one machine and not the other, I bet python's autodetection has found different things in those cases. Run this on each device:

$ python
>>> import sys
>>> sys.getfilesystemencoding()
'UTF-8'

I expect you'll see utf8 on one and ascii on the other. This is what print uses when the destination is a terminal. If you're sure that all users of your python installation (very possibly just you) prefer utf8 over ascii, you can change the default encoding of your python installation.

  1. Find your site.py: python -c 'import site; print site
  2. Open it and find the setencoding function:

    def setencoding(): 
        """Set the string encoding used by the Unicode implementation.  The 
        default is 'ascii', but if you're willing to experiment, you can 
        change this.""" 
        encoding = "ascii" # Default value set by _PyUnicode_Init() 
    
  3. Change the encoding = "ascii" line to encoding = "UTF-8"

Enjoy as things Just Work. You can find more information on this topic here: http://blog.ianbicking.org/illusive-setdefaultencoding.html

If you'd instead like a strict separation of bytes vs characters such as python3 provides, you can set encoding = "undefined". The undefined codec will "Raise an exception for all conversions. Can be used as the system encoding if no automatic coercion between byte and Unicode strings is desired."

bukzor
  • 37,539
  • 11
  • 77
  • 111
  • Your codecs.open has taken care of decoding for you. If you were to the simple `open()` you'd get bytes and wouldn't have to worry about this stuff so much, but if you're going to do any kind of processing (comparing content with another file for example) you'll want to get characters, using `codecs.open` like you have. – bukzor Jul 13 '12 at 15:20
0

The print function needs to convert the string to a printable form, since a unicode string is not automatically printable. Wrapping with repr print repr(i) will allow you to print, but you might want to specify the encode instead.

ChipJust
  • 1,376
  • 12
  • 20