Special national characters won't .split() in Python

Question

I have trouble in Python, when reading special national characters from a text file.

with open("../Data/DKsnak.txt") as f:
    content = f.readlines()

str1 = content[0]
print "string:",str1

lst1 = str1.split()
print "list:",lst1

The output is a follow:

string: Udtræk fra observatør på årstal
list: ['Udtr\xc3\xa6k', 'fra', 'observat\xc3\xb8r', 'p\xc3\xa5', '\xc3\xa5rstal']

The first line is as expected, including special Danish charcters. But they don't survive being split into a string. I have tried various tricks with codecs and unicode, but can't find the magic bullit.

Please can anyone suggest how I get these words into lists, so I can work with them as such.

Best regards Martin

Running: Python 2.7.5 (default, Feb 19 2014, 13:47:28) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] on linux2

You don't have unicode, you have a *byte string*. Encoded bytes are not individual characters. — Martijn Pieters, Apr 27 '14 at 09:37
You are confusing string *representation* with string *values*; Python is giving you representations you can use to recreate the original values. — Martijn Pieters, Apr 27 '14 at 09:42

score 2 · Accepted Answer · answered Apr 27 '14 at 09:36

Your code is fine. python simply stores its special characters like that. If you print out your text, you will still get the original strings:

s = 'Udtræk fra observatør på årstal'
s = s.split()

for i in s:
    print i

[OUTPUT]         #all fine
Udtræk
fra
observatør
på
årstal

Beta Decay · Answer 2 · 2014-04-27T09:54:47.923

2

Using the for loop as mentioned before, if you want them on the same line:

for i in len(list1):

    string += list1[i] + ' '

print(string)

edited Apr 27 '14 at 09:54

answered Apr 27 '14 at 09:46

Beta Decay

805
1
8
20

score 1 · Answer 3 · answered Apr 27 '14 at 09:39

1

from https://docs.python.org/2.7/howto/unicode.html:

import codecs
f = codecs.open('unicode.rst', encoding='utf-8')

so You get unicode and can split.

answered Apr 27 '14 at 09:39

user3535644

149
5

Special national characters won't .split() in Python

3 Answers3

Linked