0

Earlier, I've asked this: How to convert some character into five digit unicode one in Python 3.3?

But today I found the Capital U codepoint works when I print, but when I try it in a file, it turns out to fail. Why?

import re

f = codecs.open('test.txt', 'r', encoding="utf-8")
g = codecs.open('test_output.txt', 'w', encoding="utf-8")
fin = f.read()
output = re.sub('m', '\U000243D0', fin)
g.write(output)
Community
  • 1
  • 1
user1610952
  • 1,249
  • 1
  • 16
  • 31
  • 1
    Fail *how* exactly? There is nothing wrong with your code here, what is the output you get versus the output you expected? – Martijn Pieters Feb 05 '13 at 13:30
  • @dan04: the `codecs` usage points to Python 2; in Python 3 you'd just use `open()` instead, *normally*. – Martijn Pieters Feb 05 '13 at 13:35
  • I'm using Python 3.3. Strangely, m is replaced by ए. Its codepoint is \u090F. – user1610952 Feb 05 '13 at 14:06
  • @user1610952: What are you using to test the data? `\u090F` is encoded to UTF-8 as `\xE0\xA4\x8F` (three bytes starting with `\xE0`), and `\U000243D0` is encoded as `\xF0\xA4\x8F\x90`; there is an overlap there if you drop *1* bit from the first byte and ignore the `\x90` byte. *Python does not do this (I tested it)*, so what tool are you using that corrupts the data or misinterprets it? – Martijn Pieters Feb 05 '13 at 14:53

1 Answers1

1

This works just fine for me:

import re

with open('/tmp/test.txt', 'w', encoding='utf8') as testfile:
    testfile.write("I don't go to school on mondays")

with open('/tmp/test.txt', 'r', encoding='utf8') as testfile, open('/tmp/test_output.txt', 'w', encoding='utf8') as testout:
    output = re.sub('m', '\U000243D0', testfile.read())
    testout.write(output)

with open('/tmp/test_output.txt', 'r', encoding='utf8') as testfile:
    print(repr(testfile.read()))

outputs

"I don't go to school on ondays"
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Note that in 2.6 and 2.7, you can use the statement `from __future__ import unicode_literals` to make strings have type `unicode` by default. – dan04 Feb 05 '13 at 13:43
  • @dan04: Sure, that's great for people writing code that needs to run on python 2 *and* python 3, but for most developers targeting just *one* version, that's not as helpful, usually. :-) – Martijn Pieters Feb 05 '13 at 13:44
  • Thanks, but I'm using Python 3.3. Still I don't understand why it doesn't work. – user1610952 Feb 05 '13 at 14:02
  • @user1610952: Then please do show (an excerpt of) the output that was written to the file, and what you expected it to be. You could use python to read it back and show us a `repr()` of the bytes you wanted to correct. – Martijn Pieters Feb 05 '13 at 14:03
  • test.txt is: I don't go to school on mondays. And it changed into this (test_out.txt): I don't go to school on एondays. What I expected is: I don't go to school on ondays. – user1610952 Feb 05 '13 at 14:21
  • You really want to update your question. What tool did you use to display that string? If in python, can you give us the `repr()` of it? – Martijn Pieters Feb 05 '13 at 14:22
  • When add "print(output)", the result is:I don't go to school on ondays – user1610952 Feb 05 '13 at 14:52
  • When I add "print(repr(output))", the result is: "\ufeffI don't go to school on ondays." But still the written g(test_output.txt) has a strange character: I don't go to school on एondays. – user1610952 Feb 05 '13 at 14:54
  • @user1610952: What tool is displaying `test_output.txt`? That output you gave me is using a BOM, so are you on Windows and are you using Notepad perhaps? – Martijn Pieters Feb 05 '13 at 14:57
  • Oh I think that's the problem of editor. I've watched the result with Editpadpro. But I don't think it renders the character properly. In Babelpad, it works. – user1610952 Feb 05 '13 at 14:57
  • @user1610952: That's why I insisted on what tool you are using; Python is working fine (demonstrated above), it's your *other* tools that are broken.. :-) – Martijn Pieters Feb 05 '13 at 15:03