Reading arabic text encoded in utf-8 in python

Question

I am using Python 2.7. I have got the following line (string) from a text file encoded in utf-8:

"تازہ ترین خبروں، بریکنگ نیوز، ویڈیو، آڈیو، فیچر اور تجزیوں کے لیے بی بی سی اردو"

I am using the following code to print it on the screen:

import codecs
filename = codecs.open('file path', 'r', encoding="utf-8")
outputfile = filename.readlines()
print outputfile

It gives the following output:

[u'\ufeff\u062a\u0627\u0632\u06c1 \u062a\u0631\u06cc\u0646 \u062e\u0628\u0631\u0648\u06ba\u060c \u0628\u0631\u06cc\u06a9\u0646\u06af \u0646\u06cc\u0648\u0632\u060c \u0648\u06cc\u0688\u06cc\u0648\u060c \u0622\u0688\u06cc\u0648\u060c \u0641\u06cc\u0686\u0631 \u0627\u0648\u0631 \u062a\u062c\u0632\u06cc\u0648\u06ba \u06a9\u06d2 \u0644\u06cc\u06d2 \u0628\u06cc \u0628\u06cc \u0633\u06cc \u0627\u0631\u062f\u0648 \u06a9\u06cc \u0648\u06cc\u0628']

The purpose is to print the text correctly, and not how to print each line. So, how can I print the string or content of text file correctly in its original form? like:

تازہ ترین خبروں، بریکنگ نیوز، ویڈیو، آڈیو، فیچر اور تجزیوں کے لیے بی بی سی اردو

score 4 · Accepted Answer · answered Dec 12 '13 at 02:47

4

What you see is just the representation of the string. Since you're printing the list, the one shown is the representation, not the readable form.

You can print it normally, for each lines:

for line in outputfile:
    print(line)

Demo:

>>> s = u'\ufeff\u062a\u0627\u0632\u06c1 \u062a\u0631\u06cc\u0646 \u062e\u0628\u0631\u0648\u06ba\u060c \u0628\u0631\u06cc\u06a9\u0646\u06af \u0646\u06cc\u0648\u0632\u060c \u0648\u06cc\u0688\u06cc\u0648\u060c \u0622\u0688\u06cc\u0648\u060c \u0641\u06cc\u0686\u0631 \u0627\u0648\u0631 \u062a\u062c\u0632\u06cc\u0648\u06ba \u06a9\u06d2 \u0644\u06cc\u06d2 \u0628\u06cc \u0628\u06cc \u0633\u06cc \u0627\u0631\u062f\u0648 \u06a9\u06cc \u0648\u06cc\u0628'

>>> print(s)
تازہ ترین خبروں، بریکنگ نیوز، ویڈیو، آڈیو، فیچر اور تجزیوں کے لیے بی بی سی اردو کی ویب

answered Dec 12 '13 at 02:47

aIKid

26,968
4
39
65

are you using python 3? – Coddy Dec 12 '13 at 02:51
Actually, in this example, no. It's python 2.7. I'm just getting too accustomed with Python 3. – aIKid Dec 12 '13 at 03:00
i doesn't work on my system, win7. i am wondering how did you get the demo? – Coddy Dec 12 '13 at 03:05
Oh? What happened with yours? – aIKid Dec 12 '13 at 03:07
No. Windows. Operating systems doesn't matter here. What kind of output did you get? – aIKid Dec 12 '13 at 03:08
File "C:\Python27\lib\encodings\cp1252.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_table) UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-4: character maps to – Coddy Dec 12 '13 at 03:10
2

its your terminal program likely ... try running it in idle (the font in cmd.exe does not support arabic characters) – Joran Beasley Dec 12 '13 at 03:12
I am using Spyder. I tried in idle and it works. good. but why Spyder has this bus. any suggestion to fix? – Coddy Dec 12 '13 at 03:16
2

@Coddy: this may answer your question: http://stackoverflow.com/a/5708560/13005. The problem isn't the font as such: if it was then you'd see replacement characters not an encoding error. If that answer is right that Spyder uses CP-1252, then arabic characters simply cannot represented in that character encoding and so Spyder will never accept them. Best case, Spyder has some option somewhere to change the encoding used for the terminal. Change it to one that has the characters you need (UTF-8 if possible, failing that try ISO 8859-6). Then encode the string to that encoding before you print. – Steve Jessop Dec 12 '13 at 03:37
... *then* you'll find out whether Spyder has a font containing arabic characters. Or, if it's just using cmd.exe as its terminal, you'll find out whether Windows can find such a font. Last time I was thinking of changing the cmd.exe encoding one of my colleagues warned me that at least some of the advertised ways would crash it on Windows 7, so I didn't risk it. Other terminals are available for Windows, but whether Spyder will work with them is a separate issue, I don't know Spyder at all. It's all a bit fiddly. – Steve Jessop Dec 12 '13 at 03:46
2

@Coddy As others have said, the problem is with your terminal. But for encoded texts, usually you don't really need to print the text to terminal. If you really need to see the output, just print it to file or somewhere else. – aIKid Dec 12 '13 at 04:13

score 1 · Answer 2 · answered Dec 12 '13 at 02:48

readlines() returns a list. When you print a list, it prints the repr() of each item in the list. The repr of a string is encoded the way you see here to make sure it's not dependent on system encoding. You want to print the string directly:

print outputfile[0]

Reading arabic text encoded in utf-8 in python

2 Answers2