Python 3: dealing with stripping lines in binary mode

Question

with the help of SO members, i was able to reach up to as following, Following is sample code, aim is just to merges text files from give folder and it's sub folder and store output as master.txt. but i am getting traceback occasionally, looks like While reading the file it throws an error.

considering suggestions, inputs and some research it would be good idea to clean up text file in uniform unicode or employ some line by line function, so reading each line should be trimmed garbage characters and empty lines.

import shutil
import os.path

root = 'C:\\Dropbox\\test\\'
files = [(path,f) for path,_,file_list in os.walk(root) for f in file_list]

with open('C:\\Dropbox\\Python\\master.txt','wb') as output:
    for path, f_name in files:
        with open(os.path.join(path, f_name), 'rb') as input:
            shutil.copyfileobj(input, output)
        output.write(b'\n') # insert extra newline 

with open('master.txt', 'r') as f:
  lines = f.readlines()
with open('master.txt', 'w') as f:
  f.write("".join(L for L in lines if L.strip()))

Traceback I get:

Traceback (most recent call last):
  File "C:\Dropbox\Python\master1.py", line 14, in <module>
    lines = f.readlines()
  File "C:\PYTHON32\LIB\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 8159: character maps to <undefined>

@ignacio-vazquez-abrams what can make Traceback to disappear. — user1582596, Aug 31 '12 at 13:52

score 2 · Accepted Answer · answered Aug 31 '12 at 14:21

2

You've opened master.txt in text mode. When you then readlines() from it, it will decode them with the default encoding for your system. Apparently the file is in another decoding, as you get a UnicodeDecodeError.

Either open the file in binary mode, or specify the correct encoding.

answered Aug 31 '12 at 14:21

Lennart Regebro

167,292
41
224
251

10x, My text files are simple in nature, sometimes i use copy paste and that's screw-ups the thing. i don't have any preference on encoding. is there any quick way to clean-up text file so all text files will have unified encoding. or may be if clean-up can be done while reading the line. – user1582596 Aug 31 '12 at 14:30
@user1582596: To have unified encoding you will have to change the encoding on the file one by one. One way to do that would be to use the chardet library (http://pypi.python.org/pypi/chardet ) which will guess the encoding. But before you do that: Are you sure the files have different encodings? – Lennart Regebro Aug 31 '12 at 15:04
i used Notepad++ on win7-64 to create this files without any explicit encodings settings. probably this discrepancy coming from copy-paste. if i could know the file reference from traceback. may be i can revisit and try to rectify. as you can see there is no reference to my text file in traceback reference. and that's way this became a show stopper for me. – user1582596 Aug 31 '12 at 15:28
1

@user1582596: Just print the filename before reading from it. Easy. And you'll need to open the files in text mode as well, and not use copyfileobj(), but reading and writing to the files yourself. – Lennart Regebro Aug 31 '12 at 15:52
1

"print (in_file)" did the tweak, found issue with two text files and rectified manually. many 10x for your time on recent helps. also noticed about _http://python3porting.com and read very decent review on net. i am a cisco engineer. but will try to run your book. 10x again. – user1582596 Aug 31 '12 at 18:10

Python 3: dealing with stripping lines in binary mode

1 Answers1