0

I'm running into issues when trying to process mail with procmail and Python. I am using syntax something like this:

:0
...[Filter] | (python3 script.py) >> file.txt

as procmail syntax. My Python script extracts the mail from stdin, converts MIME to unicode and outputs it to a file as follows:

def main():
        dataset = Data()
        indata = (Parser().parse(sys.stdin)).as_string()

        indata = (quopri.decodestring(indata)).decode('utf-8')

        arrayofstrings = indata.split("\n")

        for line in arrayofstrings:
                [write some data to <dataset>]
        filename = "outfile.txt"
        file = open(filename, "w")
        file.write(dataset.toString())

Data() is a structure that stores a series of unicode strings and toString() concatenates them.

If I run this script in bash with a stored mail like this:

cat test.txt | python3 script.py

it correctly writes the data as unicode to the file.

However, if I get a mail and it gets processed, procmail writes the following error to the log:

UnicodeEncodeError: 'ascii' codec can't encode character '\xdf' in position 83: ordinal not in range(128)

If I change the last line of the python script to:

file.write(dataset.toString().encode('utf-8'))

I get the correctly encoded string in the file. I want it in unicode though.

halfer
  • 19,824
  • 17
  • 99
  • 186
Carl Philipp
  • 181
  • 1
  • 10
  • 1
    What's the output of `echo $LANG` in your terminal (or the terminal of the user that runs the job, if different)? – snakecharmerb Feb 26 '19 at 21:49
  • "I want it in Unicode though" is incorrect, there is no way to write Unicode to a file which does not also involve encoding it, and UTF-8 is the default Unicode encoding most places these days. – tripleee Feb 27 '19 at 07:58
  • As @snakecharmerb wrote, likely an issues with locale. Check `set | egrep '^LC_|^LANG'` and set accordingly for procmail and/or you Python script. – xebeche Feb 27 '19 at 09:29
  • Possible duplicate of https://stackoverflow.com/questions/38882721/python3-unicodeencodeerror-ascii-codec-cant-encode-character-xfc – tripleee Feb 27 '19 at 10:02

1 Answers1

1

The immediate problem is that Python sets up its system encoding based on the terminal it is connected to; but of course, when you run it from Procmail, it is not connected to a terminal at all.

The workaround might involve setting PYTHONIOENCODING in your Procmail file:

:0
...[Filter] | PYTHONIOENCODING=utf-8 python3 script.py >> file.txt

(Note also that no parentheses are required; they would run Python in a subshell, but there seems to be no reason to want it to run in a subshell here.)

Notice, though, that this is not functionally different from specifically encoding the output into UTF-8 within your script (though I can see how you might want to avoid hardcoding it). Text in a file cannot be "Unicode" without also being serialized into an encoding. (And if you don't want UTF-8, you have to specify what you do want ... UTF-16le maybe? That would be compatible with Java and legacy Windows.)

tripleee
  • 175,061
  • 34
  • 275
  • 318