1

I have the following code that search through files using RE's and if any matches are found it move the file into a different directory.

import os
import gzip
import re
import shutil

def regEx1():
    os.chdir("C:/Users/David/myfiles")
    files = os.listdir(".")
    os.mkdir("C:/Users/David/NewFiles")
    regex_txt = input("Please enter the string your are looking for:")
    for x in (files):
        inputFile = open((x), "r")
        content = inputFile.read()
        inputFile.close()
        regex = re.compile(regex_txt, re.IGNORECASE)
        if re.search(regex, content)is not None:
            shutil.copy(x, "C:/Users/David/NewFiles")

When I run it i get the following error message:

Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
  File "C:\Python33\Lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 367: character maps to <undefined>

Please could someone explain why this message appears

4 Answers4

9

In python 3, when you open a file for reading in text mode (r) it'll decode the contained text to unicode.

Since you didn't specify what encoding to use to read the file, the platform default (from locale.getpreferredencoding) is being used, and that fails in this case.

You need to either specify an encoding that can decode the file contents, or open the file in binary mode instead (and use b'' bytes patterns for your regular expressions).

See the Python Unicode HOWTO for more information.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • where would i add the "b''" –  Jan 09 '13 at 17:10
  • 1
    @LWH91: Read the HOWTO to understand what that *means* first. – Martijn Pieters Jan 09 '13 at 17:11
  • 2
    Python 3 has `open(fname, mode, encoding='whatever')`, Python 2 has `codecs.open(fname, mode, encoding='whatever')` – Jochen Ritzel Jan 09 '13 at 17:12
  • @JochenRitzel: Let's focus this on Python 3; no need to confuse matters more for the OP. :-) – Martijn Pieters Jan 09 '13 at 17:15
  • @MartijnPieters ive read it, understand the problem but still dont know how to fix it –  Jan 09 '13 at 17:26
  • As @MartijnPieters mentioned, amend the `inputFile = open(…)` line to specify file encoding as it clearly is not in `CP1252` (`UTF-8` perhaps?). – patrys Jan 09 '13 at 17:33
  • @LWH91: You need to figure out what encoding is used for that file. Sorry, I cannot do that for you. Once you figure that out, use `open(file, encoding='the_proper_file_encoding')`. – Martijn Pieters Jan 09 '13 at 18:44
  • @MartijnPieters any ideas how I could figure that out? –  Jan 09 '13 at 22:41
  • @LWH91: Lots of ways; the python way would be to use [`chardet2`](http://pypi.python.org/pypi/chardet2) to make an educated guess. It won't be 100% foolproof in it's detection though. – Martijn Pieters Jan 10 '13 at 07:13
1

I'm not too familiar with python 3x, but the below may work.

inputFile = open((x, encoding="utf8"), "r")
Chris Hawkes
  • 11,923
  • 6
  • 58
  • 68
  • when i try that i get an error saying `SyntaxError: non-keyword arg after keyword arg` –  Jan 09 '13 at 22:27
1

There's a similar question here: Python: Traceback codecs.charmap_decode(input,self.errors,decoding_table)[0]

But you might want to try:

 open((x), "r", encoding='UTF8')
Community
  • 1
  • 1
ash kim
  • 11
  • 1
0

Thank you very much for this solution. It helps me for another subject, I used :

exec (open ("DIP6.py").read ())

and I got this error because I have this symbol in a comment of DIP6.py :

 #       ● en première colonne

It works fine with :

exec (open ("DIP6.py", encoding="utf8").read ())

It also solves a problem with :

print("été") for example

in DIP6.py

I got :

été

in the console.

Thank you :-) .

Haris
  • 12,120
  • 6
  • 43
  • 70
sjlouis
  • 1
  • 1