2

I have a bunch of txt files that is encoded in shift_jis, I want to convert them to utf-8 encoding so the special characters can display properly. This has been probably asked before, but I can't seem to get it right.

Update: I changed my code so it first write to a list then it will write the content from the list.

words = []
with codecs.open("dummy.txt", mode='r+', encoding='shiftjis') as file:
    words = file.read()
    file.seek(0)
    for line in words:
        file.write(line.encode('utf-8'))

However now I get runtime error, the program just crashes. Upon further investigation, it seems like the "file.seek(0)" has caused the program to crash. The program runs without error if this line is commented. I don't know why it is so. How is it causing errors?

tripleee
  • 175,061
  • 34
  • 275
  • 318
tonywang
  • 181
  • 2
  • 13
  • 1
    Should this be tagged Python? – deceze Aug 05 '14 at 08:59
  • Thanks, I had a feeling that I was missing something – tonywang Aug 05 '14 at 09:24
  • _"Update: I changed my code so it first write to a list then it will write the content from the list."_ I don't see a list in your code snippet? – Burhan Khalid Aug 05 '14 at 11:59
  • @tonywang: Actually, your code does not write to a list at all. Your code initializes `words` with a list, then replaces it outright with the entire contents of the files a string (which is the output of `file.read()`). – Mike DeSimone Aug 05 '14 at 12:07
  • I don't think you can expect to write UTF-8 to a file opened with `encoding='shiftjis'`; even if it worked, it would just convert your UTF-8 back to Shift-JIS. – Mike DeSimone Aug 05 '14 at 12:11
  • Is that so. I must have got confused. Does it affect the code? – tonywang Aug 05 '14 at 12:14
  • From the [docs for file.seek](https://docs.python.org/2/library/stdtypes.html#file.seek): "If the file is opened in text mode (without `'b'`), only offsets returned by `tell()` are legal. Use of other offsets causes undefined behavior. Note that not all file objects are seekable." – Mike DeSimone Aug 05 '14 at 12:15
  • Oh no...So is there no way around this? – tonywang Aug 05 '14 at 12:15
  • Aside from doing one `open` for reading, followed by another for writing, no. – Mike DeSimone Aug 05 '14 at 12:16

1 Answers1

4

You can't read and write from the same file at the same time like this. That's why its not working. Input and output is buffered, and the file objects share the same file pointer, so it's hard to predict what would happen. You either need to write the output to a different file or read the entire file into memory, close it, reopen it and write it back out.

with codecs.open("dummy.txt", mode='r', encoding='shiftjis') as file:
    lines = file.read()

with codecs.open("dummy.txt", mode='w') as file:
    for line in lines:
        file.write(line)
tripleee
  • 175,061
  • 34
  • 275
  • 318
Ross Ridge
  • 38,414
  • 7
  • 81
  • 112